Skip to main content

Not a gateway.
The full stack.

Inference, vector stores, file storage, safety, tool calling, and agentic orchestration in a single OpenAI-compatible server. Pluggable providers, any language, deploy anywhere.

Try it now, no installation required (requires uv)

uvx --from 'llama-stack[starter]' llama stack run starter
/v1/responses
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
model="llama-3.3-70b",
input="Summarize this repository",
tools=[{"type": "web_search"}],
)

Everything your AI app needs. One server.

More than inference routing. Llama Stack composes inference, storage, safety, and orchestration into a single process. Your agent can search a vector store, call a tool, check safety, and stream the response. No glue code. No sidecar services.

Inference

/v1/chat/completionsChat Completions
/v1/responsesResponses
/v1/embeddingsEmbeddings
/v1/modelsModels
/v1/messagesMessagesAnthropic

Data

/v1/vector_storesVector Stores
/v1/filesFiles
/v1/batchesBatches

Safety & Tools

/v1/moderationsModerations
/v1/toolsTools
/v1/connectorsConnectors
Full API reference

A server, not a library

SDK abstractions couple your app to a specific language, release cycle, and import path. Llama Stack is an HTTP server. Your app talks to a standard API.

Write in Python, Go, TypeScript, curl. Swap the server without touching application code. That's the difference between library abstraction and server abstraction.

SDK libraryfrom sdk import ...coupled
Llama StackPOST /v1/responsesany language

23 inference providers. 13 vector stores. 7 safety backends.

Develop locally with Ollama. Deploy to production with vLLM. Wrap Bedrock or Vertex without lock-in. Same API surface, different backend.

All providers

How it works

Your application talks to one server. That server routes to pluggable providers for inference, vector storage, files, safety, and tools. The composition happens at the server level, not in your application code.

Llama Stack Architecture

Open source

Apache 2.0 licensed. Contributions welcome.