Skip to main content
OpenAI-Compatible API Server

Build AI apps with
any model, anywhere

Drop-in replacement for the OpenAI API. Use any client, any framework, any model. Swap providers without changing code.

Try it now, no installation required (requires uv)

uvx --from 'llama-stack[starter]' llama stack run starter
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
    model="llama-3.3-70b",
    input="Summarize this repository",
    tools=[{"type": "web_search"}],
)

Works with

OllamavLLMOpenAIAnthropicAWS BedrockAzure OpenAIGeminiTogether AIFireworksPGVectorQdrantChromaDBMilvusWeaviateand more

OpenAI-compatible endpoints

Use any OpenAI client library. Zero code changes.

/v1/chat/completions

Chat Completions

Standard OpenAI-compatible chat and completion endpoints

/v1/responses

Responses API

Server-side agentic orchestration with tool calling and MCP

/v1/embeddings

Embeddings

Text embeddings from any provider

/v1/vector_stores

Vector Stores

Managed document storage and semantic search

/v1/files

Files & Batches

File upload, processing, and batch operations

/v1/models

Models

Model discovery and management

How it works

One API surface, pluggable providers, deploy anywhere

Llama Stack Architecture

Plug in any provider

Develop locally with Ollama, deploy to production with vLLM or a managed service

Inference

OllamavLLMAWS BedrockAzure OpenAIOpenAIAnthropicGemini15+ more

Vector Stores

PGVectorQdrantChromaDBMilvusWeaviate4+ more

Tools

MCP ServersWeb SearchFile Search (RAG)PDF / Docling

Open source. Community driven.

Join thousands of developers building with Llama Stack