Not a gateway.
The full stack.

Inference, vector stores, file storage, safety, tool calling, and agentic orchestration in a single OpenAI-compatible server. Pluggable providers, any language, deploy anywhere.

Try it now, no installation required (requires uv)

uvx --from 'llama-stack[starter]' llama stack run starter

Get started API docs GitHub

/v1/responses

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
    model="llama-3.3-70b",
    input="Summarize this repository",
    tools=[{"type": "web_search"}],
)

Everything your AI app needs. One server.

More than inference routing. Llama Stack composes inference, storage, safety, and orchestration into a single process. Your agent can search a vector store, call a tool, check safety, and stream the response. No glue code. No sidecar services.

Inference

/v1/chat/completionsChat Completions

/v1/responsesResponses

/v1/embeddingsEmbeddings

/v1/modelsModels

/v1/messagesMessagesAnthropic

/v1alpha/interactionsInteractionsGoogle

Data

/v1/vector_storesVector Stores

/v1/filesFiles

/v1/batchesBatches

Safety & Tools

/v1/moderationsModerations

/v1/toolsTools

/v1/connectorsConnectors

Full API reference

A server, not a library

SDK abstractions couple your app to a specific language, release cycle, and import path. Llama Stack is an HTTP server. Your app talks to a standard API.

Write in Python, Go, TypeScript, curl. Swap the server without touching application code. That's the difference between library abstraction and server abstraction.

SDK libraryfrom sdk import ...coupled

Llama StackPOST /v1/responsesany language

23 inference providers. 13 vector stores. 7 safety backends.

Develop locally with Ollama. Deploy to production with vLLM. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.

Ollama vLLM OpenAI Anthropic AWS Bedrock Azure OpenAI Gemini Together AI Fireworks PGVector Qdrant ChromaDB Milvus Weaviate

All providers

How it works

Your application talks to one server. That server routes to pluggable providers for inference, vector storage, files, safety, and tools. The composition happens at the server level, not in your application code.

Open source

Apache 2.0 licensed. Contributions welcome.

GitHub Discord Documentation Blog

Not a gateway.The full stack.