Skip to main content
Version: Next

Dell Distribution of Llama Stack

:maxdepth: 2
:hidden:

self

The llamastack/distribution-dell distribution consists of the following provider configurations.

APIProvider(s)
agentsinline::meta-reference
datasetioremote::huggingface, inline::localfs
evalinline::meta-reference
inferenceremote::tgi, inline::sentence-transformers
safetyinline::llama-guard
scoringinline::basic, inline::llm-as-judge, inline::braintrust
telemetryinline::meta-reference
tool_runtimeremote::brave-search, remote::tavily-search, inline::rag-runtime
vector_ioinline::faiss, remote::chromadb, remote::pgvector

You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.

Environment Variables​

The following environment variables can be configured:

  • DEH_URL: URL for the Dell inference server (default: http://0.0.0.0:8181)
  • DEH_SAFETY_URL: URL for the Dell safety inference server (default: http://0.0.0.0:8282)
  • CHROMA_URL: URL for the Chroma server (default: http://localhost:6601)
  • INFERENCE_MODEL: Inference model loaded into the TGI server (default: meta-llama/Llama-3.2-3B-Instruct)
  • SAFETY_MODEL: Name of the safety (Llama-Guard) model to use (default: meta-llama/Llama-Guard-3-1B)

Setting up Inference server using Dell Enterprise Hub's custom TGI container.​

NOTE: This is a placeholder to run inference with TGI. This will be updated to use Dell Enterprise Hub's containers once verified.

export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321

docker run --rm -it \
--pull always \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $INFERENCE_PORT:$INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $INFERENCE_MODEL \
--port $INFERENCE_PORT --hostname 0.0.0.0

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like:

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1

docker run --rm -it \
--pull always \
--network host \
-v $HOME/.cache/huggingface:/data \
-e HF_TOKEN=$HF_TOKEN \
-p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
--gpus $CUDA_VISIBLE_DEVICES \
ghcr.io/huggingface/text-generation-inference \
--dtype bfloat16 \
--usage-stats off \
--sharded false \
--cuda-memory-fraction 0.7 \
--model-id $SAFETY_MODEL \
--hostname 0.0.0.0 \
--port $SAFETY_INFERENCE_PORT

Dell distribution relies on ChromaDB for vector database usage​

You can start a chroma-db easily using docker.

# This is where the indices are persisted
mkdir -p $HOME/chromadb

docker run --rm -it \
--network host \
--name chromadb \
-v $HOME/chromadb:/chroma/chroma \
-e IS_PERSISTENT=TRUE \
chromadb/chroma:latest \
--port $CHROMADB_PORT \
--host $CHROMADB_HOST

Running Llama Stack​

Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.

Via Docker​

This method allows you to get started quickly without having to build the distribution code.

docker run -it \
--pull always \
--network host \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
# NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
-v $HOME/git/llama-stack:/app/llama-stack-source -v $HOME/git/llama-models:/app/llama-models-source \
# localhost/distribution-dell:dev if building / testing locally
llamastack/distribution-dell\
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL

If you are using Llama Stack Safety / Shield APIs, use:

# You need a local checkout of llama-stack to run this, get it using
# git clone https://github.com/meta-llama/llama-stack.git
cd /path/to/llama-stack

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v $HOME/.llama:/root/.llama \
-v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-run.yaml \
llamastack/distribution-dell \
--config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL

Via venv​

Make sure you have done pip install llama-stack and have the Llama Stack CLI available.

llama stack build --distro dell --image-type venv
llama stack run dell
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env CHROMA_URL=$CHROMA_URL

If you are using Llama Stack Safety / Shield APIs, use:

llama stack run ./run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env DEH_URL=$DEH_URL \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
--env CHROMA_URL=$CHROMA_URL