Skip to main content

Building Custom Distributions

This guide walks you through inspecting existing distributions, customising their configuration, and building runnable artefacts for your own deployment.

Explore existing distributions

All first-party distributions live under llama_stack/distributions/. Each directory contains:

  • build.yaml – the distribution specification (providers, additional dependencies, optional external provider directories).
  • config.yaml – sample run configuration (when provided).
  • Documentation fragments that power this site.

Browse that folder to understand available providers and copy a distribution to use as a starting point. When creating a new stack, duplicate an existing directory, rename it, and adjust the build.yaml file to match your requirements.

Use the Containerfile at containers/Containerfile, which installs llama-stack, resolves distribution dependencies via llama stack list-deps, and sets the entrypoint to llama stack run.

Single-architecture build:

docker build . \
-f containers/Containerfile \
--build-arg DISTRO_NAME=starter \
--tag llama-stack:starter

Multi-architecture build:

The Containerfile supports multi-architecture builds for linux/amd64 and linux/arm64. To build and push images with a multi-arch image index, you can use either Docker buildx or Podman buildx:

# Build and push multi-arch image index (creates manifest list)
docker buildx build --platform linux/amd64,linux/arm64 \
--push \
-f containers/Containerfile \
--build-arg DISTRO_NAME=starter \
--tag docker.io/llamastack/distribution-starter:latest .

To add more architectures in the future, extend the --platform flag:

docker buildx build --platform linux/amd64,linux/arm64,linux/s390x,linux/ppc64le \
--push \
-f containers/Containerfile \
--build-arg DISTRO_NAME=starter \
--tag docker.io/llamastack/distribution-starter:latest .

Handy build arguments:

  • DISTRO_NAME – distribution directory name (defaults to starter).
  • RUN_CONFIG_PATH – absolute path inside the build context for a run config that should be baked into the image (e.g. /workspace/config.yaml).
  • INSTALL_MODE=editable – install the repository copied into /workspace with uv pip install -e. Pair it with --build-arg LLAMA_STACK_DIR=/workspace.
  • LLAMA_STACK_CLIENT_DIR – optional editable install of the Python client.
  • PYPI_VERSION / TEST_PYPI_VERSION – pin specific releases when not using editable installs.
  • KEEP_WORKSPACE=1 – retain /workspace in the final image if you need to access additional files (such as sample configs or provider bundles).

Make sure any custom build.yaml, run configs, or provider directories you reference are included in the Docker build context so the Containerfile can read them.

Air-gapped and disconnected deployments

If you are building images for air-gapped or network-restricted clusters (e.g. disconnected OpenShift / Kubernetes environments), you must pre-cache the tiktoken cl100k_base encoding during the image build. Without it, the first call to vector_stores.files.create() will attempt a runtime HTTP download from openaipublic.blob.core.windows.net, which will fail when outbound internet access is unavailable.

The container Dockerfile/Containerfile already includes this step. When building a custom Container, add the following after all Python packages have been installed:

# Pre-cache tiktoken encoding for air-gapped deployments.
# Must come after llama-stack (and tiktoken) are installed.
ENV TIKTOKEN_CACHE_DIR="/.cache/tiktoken"
RUN python3 -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"

Ensure /.cache/tiktoken has appropriate permissions for the runtime user, or adjust TIKTOKEN_CACHE_DIR to a path your container user can read:

RUN mkdir -p /.cache/tiktoken && chmod -R g+rw /.cache/tiktoken
ENV TIKTOKEN_CACHE_DIR="/.cache/tiktoken"
RUN python3 -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"
note

TIKTOKEN_CACHE_DIR must be set as a persistent ENV (not just a build-time ARG) so the server process can find the cached encoding files at runtime.

If you need to add additional encodings (e.g. for a custom chunking strategy), pre-cache each one in the same RUN step:

RUN python3 -c "
import tiktoken
tiktoken.get_encoding('cl100k_base')
tiktoken.get_encoding('p50k_base')
"

Run your stack server

After building the image, launch it directly with Docker or Podman—the entrypoint calls llama stack run using the baked distribution or the bundled run config:

docker run -d \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
-e INFERENCE_MODEL=$INFERENCE_MODEL \
-e OLLAMA_URL=http://host.docker.internal:11434 \
llama-stack:starter \
--port $LLAMA_STACK_PORT

Here are the docker flags and their uses:

  • -d: Runs the container in the detached mode as a background process

  • -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT: Maps the container port to the host port for accessing the server

  • -v ~/.llama:/root/.llama: Mounts the local .llama directory to persist configurations and data

  • localhost/distribution-ollama:dev: The name and tag of the container image to run

  • -e INFERENCE_MODEL=$INFERENCE_MODEL: Sets the INFERENCE_MODEL environment variable in the container

  • -e OLLAMA_URL=http://host.docker.internal:11434: Sets the OLLAMA_URL environment variable in the container

  • --port $LLAMA_STACK_PORT: Port number for the server to listen on

If you prepared a custom run config, mount it into the container and reference it explicitly:

docker run \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $(pwd)/config.yaml:/app/config.yaml \
llama-stack:starter \
/app/config.yaml