Meta Reference GPU Distribution
:maxdepth: 2
:hidden:
self
The llamastack/distribution-meta-reference-gpu
distribution consists of the following provider configurations:
API | Provider(s) |
---|---|
agents | inline::meta-reference |
datasetio | remote::huggingface , inline::localfs |
eval | inline::meta-reference |
inference | inline::meta-reference |
safety | inline::llama-guard |
scoring | inline::basic , inline::llm-as-judge , inline::braintrust |
telemetry | inline::meta-reference |
tool_runtime | remote::brave-search , remote::tavily-search , inline::rag-runtime , remote::model-context-protocol |
vector_io | inline::faiss , remote::chromadb , remote::pgvector |
Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
Environment Variablesโ
The following environment variables can be configured:
LLAMA_STACK_PORT
: Port for the Llama Stack distribution server (default:8321
)INFERENCE_MODEL
: Inference model loaded into the Meta Reference server (default:meta-llama/Llama-3.2-3B-Instruct
)INFERENCE_CHECKPOINT_DIR
: Directory containing the Meta Reference model checkpoint (default:null
)SAFETY_MODEL
: Name of the safety (Llama-Guard) model to use (default:meta-llama/Llama-Guard-3-1B
)SAFETY_CHECKPOINT_DIR
: Directory containing the Llama-Guard model checkpoint (default:null
)
Prerequisite: Downloading Modelsโ
Please use llama model list --downloaded
to check that you have llama model checkpoints downloaded in ~/.llama
before proceeding. See installation guide here to download the models. Run llama model list
to see the available models to download, and llama model download
to download the checkpoints.
$ llama model list --downloaded
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโ
โ Model โ Size โ Modified Time โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Llama3.2-1B-Instruct:int4-qlora-eo8 โ 1.53 GB โ 2025-02-26 11:22:28 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama3.2-1B โ 2.31 GB โ 2025-02-18 21:48:52 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Prompt-Guard-86M โ 0.02 GB โ 2025-02-26 11:29:28 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama3.2-3B-Instruct:int4-spinquant-eo8 โ 3.69 GB โ 2025-02-26 11:37:41 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama3.2-3B โ 5.99 GB โ 2025-02-18 21:51:26 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama3.1-8B โ 14.97 GB โ 2025-02-16 10:36:37 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama3.2-1B-Instruct:int4-spinquant-eo8 โ 1.51 GB โ 2025-02-26 11:35:02 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama-Guard-3-1B โ 2.80 GB โ 2025-02-26 11:20:46 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Llama-Guard-3-1B:int4 โ 0.43 GB โ 2025-02-26 11:33:33 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
Running the Distributionโ
You can do this via venv or Docker which has a pre-built image.
Via Dockerโ
This method allows you to get started quickly without having to build the distribution code.
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
--gpu all \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
If you are using Llama Stack Safety / Shield APIs, use:
docker run \
-it \
--pull always \
--gpu all \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-meta-reference-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
Via venvโ
Make sure you have done uv pip install llama-stack
and have the Llama Stack CLI available.
llama stack build --distro meta-reference-gpu --image-type venv
llama stack run distributions/meta-reference-gpu/run.yaml \
--port 8321 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
If you are using Llama Stack Safety / Shield APIs, use:
llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
--port 8321 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B