Not a gateway.
The full stack.
Inference, vector stores, file storage, safety, tool calling, and agentic orchestration in a single OpenAI-compatible server. Pluggable providers, any language, deploy anywhere.
Try it now, no installation required (requires uv)
uvx --from 'llama-stack[starter]' llama stack run starter/v1/responsesfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")response = client.responses.create( model="llama-3.3-70b", input="Summarize this repository", tools=[{"type": "web_search"}],)Everything your AI app needs. One server.
More than inference routing. Llama Stack composes inference, storage, safety, and orchestration into a single process. Your agent can search a vector store, call a tool, check safety, and stream the response. No glue code. No sidecar services.
Inference
/v1/chat/completionsChat Completions/v1/responsesResponses/v1/embeddingsEmbeddings/v1/modelsModels/v1/messagesMessagesAnthropicData
/v1/vector_storesVector Stores/v1/filesFiles/v1/batchesBatchesSafety & Tools
/v1/moderationsModerations/v1/toolsTools/v1/connectorsConnectorsA server, not a library
SDK abstractions couple your app to a specific language, release cycle, and import path. Llama Stack is an HTTP server. Your app talks to a standard API.
Write in Python, Go, TypeScript, curl. Swap the server without touching application code. That's the difference between library abstraction and server abstraction.
from sdk import ...coupledPOST /v1/responsesany language23 inference providers. 13 vector stores. 7 safety backends.
Develop locally with Ollama. Deploy to production with vLLM. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.
How it works
Your application talks to one server. That server routes to pluggable providers for inference, vector storage, files, safety, and tools. The composition happens at the server level, not in your application code.
Open source
Apache 2.0 licensed. Contributions welcome.