Contributing to Llama Stack
We want to make contributing to this project as easy and transparent as possible.
Set up your development environment
We use uv to manage python dependencies and virtual environments.
You can install uv
by following this guide.
You can install the dependencies by running:
cd llama-stack
uv sync --group dev
uv pip install -e .
source .venv/bin/activate
Note
You can use a specific version of Python with uv
by adding the --python <version>
flag (e.g. --python 3.12
).
Otherwise, uv
will automatically select a Python version according to the requires-python
section of the pyproject.toml
.
For more info, see the uv docs around Python versions.
Note that you can create a dotenv file .env
that includes necessary environment variables:
LLAMA_STACK_BASE_URL=http://localhost:8321
LLAMA_STACK_CLIENT_LOG=debug
LLAMA_STACK_PORT=8321
LLAMA_STACK_CONFIG=<provider-name>
TAVILY_SEARCH_API_KEY=
BRAVE_SEARCH_API_KEY=
And then use this dotenv file when running client SDK tests via the following:
uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
Pre-commit Hooks
We use pre-commit to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
uv run pre-commit install
After that, pre-commit hooks will run automatically before each commit.
Alternatively, if you don’t want to install the pre-commit hooks, you can run the checks manually by running:
uv run pre-commit run --all-files
Caution
Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
Discussions -> Issues -> Pull Requests
We actively welcome your pull requests. However, please read the following. This is heavily inspired by Ghostty.
If in doubt, please open a discussion; we can always convert that to an issue later.
Issues
We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.
Meta has a bounty program for the safe disclosure of security bugs. In those cases, please go through the process outlined on that page and do not file a public issue.
Contributor License Agreement (“CLA”)
In order to accept your pull request, we need you to submit a CLA. You only need to do this once to work on any of Meta’s open source projects.
Complete your CLA here: https://code.facebook.com/cla
I’d like to contribute!
If you are new to the project, start by looking at the issues tagged with “good first issue”. If you’re interested leave a comment on the issue and a triager will assign it to you.
Please avoid picking up too many issues at once. This helps you stay focused and ensures that others in the community also have opportunities to contribute.
Try to work on only 1–2 issues at a time, especially if you’re still getting familiar with the codebase.
Before taking an issue, check if it’s already assigned or being actively discussed.
If you’re blocked or can’t continue with an issue, feel free to unassign yourself or leave a comment so others can step in.
I have a bug!
Search the issue tracker and discussions for similar issues.
If you don’t have steps to reproduce, open a discussion.
If you have steps to reproduce, open an issue.
I have an idea for a feature!
Open a discussion.
I’ve implemented a feature!
If there is an issue for the feature, open a pull request.
If there is no issue, open a discussion and link to your branch.
I have a question!
Open a discussion or use Discord.
Opening a Pull Request
Fork the repo and create your branch from
main
.If you’ve changed APIs, update the documentation.
Ensure the test suite passes.
Make sure your code lints using
pre-commit
.If you haven’t already, complete the Contributor License Agreement (“CLA”).
Ensure your pull request follows the conventional commits format.
Ensure your pull request follows the coding style.
Please keep pull requests (PRs) small and focused. If you have a large set of changes, consider splitting them into logically grouped, smaller PRs to facilitate review and testing.
Tip
As a general guideline:
Experienced contributors should try to keep no more than 5 open PRs at a time.
New contributors are encouraged to have only one open PR at a time until they’re familiar with the codebase and process.
Repository guidelines
Coding Style
Comments should provide meaningful insights into the code. Avoid filler comments that simply describe the next step, as they create unnecessary clutter, same goes for docstrings.
Prefer comments to clarify surprising behavior and/or relationships between parts of the code rather than explain what the next line of code does.
Catching exceptions, prefer using a specific exception type rather than a broad catch-all like
Exception
.Error messages should be prefixed with “Failed to …”
4 spaces for indentation rather than tab
When using
# noqa
to suppress a style or linter warning, include a comment explaining the justification for bypassing the check.When using
# type: ignore
to suppress a mypy warning, include a comment explaining the justification for bypassing the check.Don’t use unicode characters in the codebase. ASCII-only is preferred for compatibility or readability reasons.
Providers configuration class should be Pydantic Field class. It should have a
description
field that describes the configuration. These descriptions will be used to generate the provider documentation.When possible, use keyword arguments only when calling functions.
Llama Stack utilizes custom Exception classes for certain Resources that should be used where applicable.
License
By contributing to Llama, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.
Common Tasks
Some tips about common tasks you work on while contributing to Llama Stack:
Using llama stack build
Building a stack image will use the production version of the llama-stack
and llama-stack-client
packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set LLAMA_STACK_DIR
and LLAMA_STACK_CLIENT_DIR
to the appropriate checked out directories when running any of the llama
CLI commands.
Example:
cd work/
git clone https://github.com/meta-llama/llama-stack.git
git clone https://github.com/meta-llama/llama-stack-client-python.git
cd llama-stack
LLAMA_STACK_DIR=$(pwd) LLAMA_STACK_CLIENT_DIR=../llama-stack-client-python llama stack build --distro <...>
Updating distribution configurations
If you have made changes to a provider’s configuration in any form (introducing a new config key, or
changing models, etc.), you should run ./scripts/distro_codegen.py
to re-generate various YAML
files as well as the documentation. You should not change docs/source/.../distributions/
files
manually as they are auto-generated.
Updating the provider documentation
If you have made changes to a provider’s configuration, you should run ./scripts/provider_codegen.py
to re-generate the documentation. You should not change docs/source/.../providers/
files manually
as they are auto-generated.
Note that the provider “description” field will be used to generate the provider documentation.
Building the Documentation
If you are making changes to the documentation at https://llama-stack.readthedocs.io/en/latest/, you can use the following command to build the documentation and preview your changes. You will need Sphinx and the readthedocs theme.
# This rebuilds the documentation pages.
uv run --group docs make -C docs/ html
# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
Update API Documentation
If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:
uv run ./docs/openapi_generator/run_openapi_generator.sh
The generated API documentation will be available in docs/_static/
. Make sure to review the changes before committing.
Adding a New Provider
See:
Adding a New API Provider Page which describes how to add new API providers to the Stack.
Vector Database Page which describes how to add a new vector databases with Llama Stack.
External Provider Page which describes how to add external providers to the Stack.
Testing
There are two obvious types of tests:
Type |
Location |
Purpose |
---|---|---|
Unit |
Fast, isolated component testing |
|
Integration |
End-to-end workflows with record-replay |
Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on “fakes”. Mocks are too brittle. In either case, tests must be very fast and reliable.
Record-replay for integration tests
Testing AI applications end-to-end creates some challenges:
API costs accumulate quickly during development and CI
Non-deterministic responses make tests unreliable
Multiple providers require testing the same logic across different APIs
Our solution: Record real API responses once, replay them for fast, deterministic tests. This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.
This gives you:
Cost control - No repeated API calls during development
Speed - Instant test execution with cached responses
Reliability - Consistent results regardless of external service state
Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.
Testing Quick Start
You can run the unit tests with:
uv run --group unit pytest -sv tests/unit/
For running integration tests, you must provide a few things:
A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
server:<config>
- automatically start a server with the given config (e.g.,server:starter
). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.server:<config>:<port>
- same as above but with a custom port (e.g.,server:starter:8322
)a URL which points to a Llama Stack distribution server
a distribution name (e.g.,
starter
) or a path to arun.yaml
filea comma-separated list of api=provider pairs, e.g.
inference=fireworks,safety=llama-guard,agents=meta-reference
. This is most useful for testing a single API surface.
Any API keys you need to use should be set in the environment, or can be passed in with the –env option.
You can run the integration tests in replay mode with:
# Run all tests with existing recordings
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter
Re-recording tests
Local Re-recording (Manual Setup Required)
If you want to re-record tests locally, you can do so with:
LLAMA_STACK_TEST_INFERENCE_MODE=record \
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter -k "<appropriate test name>"
This will record new API responses and overwrite the existing recordings.
Warning
You must be careful when re-recording. CI workflows assume a specific setup for running the replay-mode tests. You must re-record the tests in the same way as the CI workflows. This means
you need Ollama running and serving some specific models.
you are using the
starter
distribution.
Remote Re-recording (Recommended)
For easier re-recording without local setup, use the automated recording workflow:
# Record tests for specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"
# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --test-suite vision
# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm
This script:
🚀 Runs in GitHub Actions - no local Ollama setup required
🔍 Auto-detects your branch and associated PR
🍴 Works from forks - handles repository context automatically
✅ Commits recordings back to your branch
Prerequisites:
GitHub CLI:
brew install gh && gh auth login
jq:
brew install jq
Your branch pushed to a remote
Supported providers: vllm
, ollama
Next Steps
Integration Testing Guide - Detailed usage and configuration
Unit Testing Guide - Fast component testing
Advanced Topics
For developers who need deeper understanding of the testing system internals:
Benchmarking
Llama Stack Benchmark Suite on Kubernetes
Motivation
Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.
Why This Benchmark Suite Exists
Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:
Llama Stack inference (with vLLM backend)
Direct vLLM inference calls
Both under identical Kubernetes deployment conditions
Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.
Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.
Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:
Kubernetes resource allocation (CPU, memory, GPU)
Auto-scaling configurations
Cost optimization strategies
Key Metrics Captured
The benchmark suite measures critical performance indicators:
Throughput: Requests per second under sustained load
Latency Distribution: P50, P95, P99 response times
Time to First Token (TTFT): Critical for streaming applications
Error Rates: Request failures and timeout analysis
This data enables data-driven architectural decisions and performance optimization efforts.
Setup
1. Deploy base k8s infrastructure:
cd ../k8s
./apply.sh
2. Deploy benchmark components:
cd ../k8s-benchmark
./apply.sh
3. Verify deployment:
kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.
Quick Start
Basic Benchmarks
Benchmark Llama Stack (default):
cd docs/source/distributions/k8s-benchmark/
./run-benchmark.sh
Benchmark vLLM direct:
./run-benchmark.sh --target vllm
Custom Configuration
Extended benchmark with high concurrency:
./run-benchmark.sh --target vllm --duration 120 --concurrent 20
Short test run:
./run-benchmark.sh --target stack --duration 30 --concurrent 5
Command Reference
run-benchmark.sh Options
./run-benchmark.sh [options]
Options:
-t, --target <stack|vllm> Target to benchmark (default: stack)
-d, --duration <seconds> Duration in seconds (default: 60)
-c, --concurrent <users> Number of concurrent users (default: 10)
-h, --help Show help message
Examples:
./run-benchmark.sh --target vllm # Benchmark vLLM direct
./run-benchmark.sh --target stack # Benchmark Llama Stack
./run-benchmark.sh -t vllm -d 120 -c 20 # vLLM with 120s, 20 users
Local Testing
Running Benchmark Locally
For local development without Kubernetes:
1. Start OpenAI mock server:
uv run python openai-mock-server.py --port 8080
2. Run benchmark against mock server:
uv run python benchmark.py \
--base-url http://localhost:8080/v1 \
--model mock-inference \
--duration 30 \
--concurrent 5
3. Test against local vLLM server:
# If you have vLLM running locally on port 8000
uv run python benchmark.py \
--base-url http://localhost:8000/v1 \
--model meta-llama/Llama-3.2-3B-Instruct \
--duration 30 \
--concurrent 5
4. Profile the running server:
./profile_running_server.sh
OpenAI Mock Server
The openai-mock-server.py
provides:
OpenAI-compatible API for testing without real models
Configurable streaming delay via
STREAM_DELAY_SECONDS
env varConsistent responses for reproducible benchmarks
Lightweight testing without GPU requirements
Mock server usage:
uv run python openai-mock-server.py --port 8080
The mock server is also deployed in k8s as openai-mock-service:8080
and can be used by changing the Llama Stack configuration to use the mock-vllm-inference
provider.
Files in this Directory
benchmark.py
- Core benchmark script with async streaming supportrun-benchmark.sh
- Main script with target selection and configurationopenai-mock-server.py
- Mock OpenAI API server for local testingREADME.md
- This documentation file