Contributing to Llama Stack

We want to make contributing to this project as easy and transparent as possible.

Set up your development environment

We use uv to manage python dependencies and virtual environments. You can install uv by following this guide.

You can install the dependencies by running:

cd llama-stack
uv sync --group dev
uv pip install -e .
source .venv/bin/activate

Note

You can use a specific version of Python with uv by adding the --python <version> flag (e.g. --python 3.12). Otherwise, uv will automatically select a Python version according to the requires-python section of the pyproject.toml. For more info, see the uv docs around Python versions.

Note that you can create a dotenv file .env that includes necessary environment variables:

LLAMA_STACK_BASE_URL=http://localhost:8321
LLAMA_STACK_CLIENT_LOG=debug
LLAMA_STACK_PORT=8321
LLAMA_STACK_CONFIG=<provider-name>
TAVILY_SEARCH_API_KEY=
BRAVE_SEARCH_API_KEY=

And then use this dotenv file when running client SDK tests via the following:

uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct

Pre-commit Hooks

We use pre-commit to run linting and formatting checks on your code. You can install the pre-commit hooks by running:

uv run pre-commit install

After that, pre-commit hooks will run automatically before each commit.

Alternatively, if you don’t want to install the pre-commit hooks, you can run the checks manually by running:

uv run pre-commit run --all-files

Caution

Before pushing your changes, make sure that the pre-commit hooks have passed successfully.

Discussions -> Issues -> Pull Requests

We actively welcome your pull requests. However, please read the following. This is heavily inspired by Ghostty.

If in doubt, please open a discussion; we can always convert that to an issue later.

Issues

We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue.

Meta has a bounty program for the safe disclosure of security bugs. In those cases, please go through the process outlined on that page and do not file a public issue.

Contributor License Agreement (“CLA”)

In order to accept your pull request, we need you to submit a CLA. You only need to do this once to work on any of Meta’s open source projects.

Complete your CLA here: https://code.facebook.com/cla

I’d like to contribute!

If you are new to the project, start by looking at the issues tagged with “good first issue”. If you’re interested leave a comment on the issue and a triager will assign it to you.

Please avoid picking up too many issues at once. This helps you stay focused and ensures that others in the community also have opportunities to contribute.

Try to work on only 1–2 issues at a time, especially if you’re still getting familiar with the codebase.
Before taking an issue, check if it’s already assigned or being actively discussed.
If you’re blocked or can’t continue with an issue, feel free to unassign yourself or leave a comment so others can step in.

I have a bug!

Search the issue tracker and discussions for similar issues.
If you don’t have steps to reproduce, open a discussion.
If you have steps to reproduce, open an issue.

I have an idea for a feature!

Open a discussion.

I’ve implemented a feature!

If there is an issue for the feature, open a pull request.
If there is no issue, open a discussion and link to your branch.

I have a question!

Open a discussion or use Discord.

Opening a Pull Request

Fork the repo and create your branch from main.
If you’ve changed APIs, update the documentation.
Ensure the test suite passes.
Make sure your code lints using pre-commit.
If you haven’t already, complete the Contributor License Agreement (“CLA”).
Ensure your pull request follows the conventional commits format.
Ensure your pull request follows the coding style.

Please keep pull requests (PRs) small and focused. If you have a large set of changes, consider splitting them into logically grouped, smaller PRs to facilitate review and testing.

Tip

As a general guideline:

Experienced contributors should try to keep no more than 5 open PRs at a time.
New contributors are encouraged to have only one open PR at a time until they’re familiar with the codebase and process.

Repository guidelines

Coding Style

Comments should provide meaningful insights into the code. Avoid filler comments that simply describe the next step, as they create unnecessary clutter, same goes for docstrings.
Prefer comments to clarify surprising behavior and/or relationships between parts of the code rather than explain what the next line of code does.
Catching exceptions, prefer using a specific exception type rather than a broad catch-all like Exception.
Error messages should be prefixed with “Failed to …”
4 spaces for indentation rather than tab
When using # noqa to suppress a style or linter warning, include a comment explaining the justification for bypassing the check.
When using # type: ignore to suppress a mypy warning, include a comment explaining the justification for bypassing the check.
Don’t use unicode characters in the codebase. ASCII-only is preferred for compatibility or readability reasons.
Providers configuration class should be Pydantic Field class. It should have a description field that describes the configuration. These descriptions will be used to generate the provider documentation.
When possible, use keyword arguments only when calling functions.
Llama Stack utilizes custom Exception classes for certain Resources that should be used where applicable.

License

By contributing to Llama, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Common Tasks

Some tips about common tasks you work on while contributing to Llama Stack:

Using `llama stack build`

Building a stack image will use the production version of the llama-stack and llama-stack-client packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set LLAMA_STACK_DIR and LLAMA_STACK_CLIENT_DIR to the appropriate checked out directories when running any of the llama CLI commands.

Example:

cd work/
git clone https://github.com/meta-llama/llama-stack.git
git clone https://github.com/meta-llama/llama-stack-client-python.git
cd llama-stack
LLAMA_STACK_DIR=$(pwd) LLAMA_STACK_CLIENT_DIR=../llama-stack-client-python llama stack build --distro <...>

Updating distribution configurations

If you have made changes to a provider’s configuration in any form (introducing a new config key, or changing models, etc.), you should run ./scripts/distro_codegen.py to re-generate various YAML files as well as the documentation. You should not change docs/source/.../distributions/ files manually as they are auto-generated.

Updating the provider documentation

If you have made changes to a provider’s configuration, you should run ./scripts/provider_codegen.py to re-generate the documentation. You should not change docs/source/.../providers/ files manually as they are auto-generated. Note that the provider “description” field will be used to generate the provider documentation.

Building the Documentation

If you are making changes to the documentation at https://llama-stack.readthedocs.io/en/latest/, you can use the following command to build the documentation and preview your changes. You will need Sphinx and the readthedocs theme.

# This rebuilds the documentation pages.
uv run --group docs make -C docs/ html

# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all

Update API Documentation

If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:

uv run ./docs/openapi_generator/run_openapi_generator.sh

The generated API documentation will be available in docs/_static/. Make sure to review the changes before committing.

Adding a New Provider

See:

Adding a New API Provider Page which describes how to add new API providers to the Stack.
Vector Database Page which describes how to add a new vector databases with Llama Stack.
External Provider Page which describes how to add external providers to the Stack.

Testing

There are two obvious types of tests:

Type	Location	Purpose
Unit	`tests/unit/`	Fast, isolated component testing
Integration	`tests/integration/`	End-to-end workflows with record-replay

Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on “fakes”. Mocks are too brittle. In either case, tests must be very fast and reliable.

Record-replay for integration tests

Testing AI applications end-to-end creates some challenges:

API costs accumulate quickly during development and CI
Non-deterministic responses make tests unreliable
Multiple providers require testing the same logic across different APIs

Our solution: Record real API responses once, replay them for fast, deterministic tests. This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.

This gives you:

Cost control - No repeated API calls during development
Speed - Instant test execution with cached responses
Reliability - Consistent results regardless of external service state
Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.

Testing Quick Start

You can run the unit tests with:

uv run --group unit pytest -sv tests/unit/

For running integration tests, you must provide a few things:

A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
- server:<config> - automatically start a server with the given config (e.g., server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- server:<config>:<port> - same as above but with a custom port (e.g., server:starter:8322)
- a URL which points to a Llama Stack distribution server
- a distribution name (e.g., starter) or a path to a run.yaml file
- a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
Any API keys you need to use should be set in the environment, or can be passed in with the –env option.

You can run the integration tests in replay mode with:

# Run all tests with existing recordings
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter

Re-recording tests

Local Re-recording (Manual Setup Required)

If you want to re-record tests locally, you can do so with:

LLAMA_STACK_TEST_INFERENCE_MODE=record \
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter -k "<appropriate test name>"

This will record new API responses and overwrite the existing recordings.

Warning

You must be careful when re-recording. CI workflows assume a specific setup for running the replay-mode tests. You must re-record the tests in the same way as the CI workflows. This means

you need Ollama running and serving some specific models.
you are using the starter distribution.

Remote Re-recording (Recommended)

For easier re-recording without local setup, use the automated recording workflow:

# Record tests for specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"

# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --test-suite vision

# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm

This script:

🚀 Runs in GitHub Actions - no local Ollama setup required
🔍 Auto-detects your branch and associated PR
🍴 Works from forks - handles repository context automatically
✅ Commits recordings back to your branch

Prerequisites:

GitHub CLI: brew install gh && gh auth login
jq: brew install jq
Your branch pushed to a remote

Supported providers: vllm, ollama

Next Steps

Integration Testing Guide - Detailed usage and configuration
Unit Testing Guide - Fast component testing

Advanced Topics

For developers who need deeper understanding of the testing system internals:

Record-Replay System

Benchmarking

Llama Stack Benchmark Suite on Kubernetes

Motivation

Performance benchmarking is critical for understanding the overhead and characteristics of the Llama Stack abstraction layer compared to direct inference engines like vLLM.

Why This Benchmark Suite Exists

Performance Validation: The Llama Stack provides a unified API layer across multiple inference providers, but this abstraction introduces potential overhead. This benchmark suite quantifies the performance impact by comparing:

Llama Stack inference (with vLLM backend)
Direct vLLM inference calls
Both under identical Kubernetes deployment conditions

Production Readiness Assessment: Real-world deployments require understanding performance characteristics under load. This suite simulates concurrent user scenarios with configurable parameters (duration, concurrency, request patterns) to validate production readiness.

Regression Detection (TODO): As the Llama Stack evolves, this benchmark provides automated regression detection for performance changes. CI/CD pipelines can leverage these benchmarks to catch performance degradations before production deployments.

Resource Planning: By measuring throughput, latency percentiles, and resource utilization patterns, teams can make informed decisions about:

Kubernetes resource allocation (CPU, memory, GPU)
Auto-scaling configurations
Cost optimization strategies

Key Metrics Captured

The benchmark suite measures critical performance indicators:

Throughput: Requests per second under sustained load
Latency Distribution: P50, P95, P99 response times
Time to First Token (TTFT): Critical for streaming applications
Error Rates: Request failures and timeout analysis

This data enables data-driven architectural decisions and performance optimization efforts.

Setup

1. Deploy base k8s infrastructure:

cd ../k8s
./apply.sh

2. Deploy benchmark components:

cd ../k8s-benchmark
./apply.sh

3. Verify deployment:

kubectl get pods
# Should see: llama-stack-benchmark-server, vllm-server, etc.

Quick Start

Basic Benchmarks

Benchmark Llama Stack (default):

cd docs/source/distributions/k8s-benchmark/
./run-benchmark.sh

Benchmark vLLM direct:

./run-benchmark.sh --target vllm

Custom Configuration

Extended benchmark with high concurrency:

./run-benchmark.sh --target vllm --duration 120 --concurrent 20

Short test run:

./run-benchmark.sh --target stack --duration 30 --concurrent 5

Command Reference

run-benchmark.sh Options

./run-benchmark.sh [options]

Options:
  -t, --target <stack|vllm>     Target to benchmark (default: stack)
  -d, --duration <seconds>      Duration in seconds (default: 60)
  -c, --concurrent <users>      Number of concurrent users (default: 10)
  -h, --help                    Show help message

Examples:
  ./run-benchmark.sh --target vllm              # Benchmark vLLM direct
  ./run-benchmark.sh --target stack             # Benchmark Llama Stack
  ./run-benchmark.sh -t vllm -d 120 -c 20       # vLLM with 120s, 20 users

Local Testing

Running Benchmark Locally

For local development without Kubernetes:

1. Start OpenAI mock server:

uv run python openai-mock-server.py --port 8080

2. Run benchmark against mock server:

uv run python benchmark.py \
  --base-url http://localhost:8080/v1 \
  --model mock-inference \
  --duration 30 \
  --concurrent 5

3. Test against local vLLM server:

# If you have vLLM running locally on port 8000
uv run python benchmark.py \
  --base-url http://localhost:8000/v1 \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --duration 30 \
  --concurrent 5

4. Profile the running server:

./profile_running_server.sh

OpenAI Mock Server

The openai-mock-server.py provides:

OpenAI-compatible API for testing without real models
Configurable streaming delay via STREAM_DELAY_SECONDS env var
Consistent responses for reproducible benchmarks
Lightweight testing without GPU requirements

Mock server usage:

uv run python openai-mock-server.py --port 8080

The mock server is also deployed in k8s as openai-mock-service:8080 and can be used by changing the Llama Stack configuration to use the mock-vllm-inference provider.

Files in this Directory

benchmark.py - Core benchmark script with async streaming support
run-benchmark.sh - Main script with target selection and configuration
openai-mock-server.py - Mock OpenAI API server for local testing
README.md - This documentation file