Version: v0.4.2

Dell Distribution of Llama Stack

:maxdepth: 2
:hidden:

self

The llamastack/distribution-dell distribution consists of the following provider configurations.

API	Provider(s)
agents	`inline::meta-reference`
datasetio	`remote::huggingface`, `inline::localfs`
eval	`inline::meta-reference`
inference	`remote::tgi`, `inline::sentence-transformers`
safety	`inline::llama-guard`
scoring	`inline::basic`, `inline::llm-as-judge`, `inline::braintrust`
tool_runtime	`remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`
vector_io	`inline::faiss`, `remote::chromadb`, `remote::pgvector`

You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.

Environment Variables

The following environment variables can be configured:

DEH_URL: URL for the Dell inference server (default: http://0.0.0.0:8181)
DEH_SAFETY_URL: URL for the Dell safety inference server (default: http://0.0.0.0:8282)
CHROMA_URL: URL for the Chroma server (default: http://localhost:6601)
INFERENCE_MODEL: Inference model loaded into the TGI server (default: meta-llama/Llama-3.2-3B-Instruct)
SAFETY_MODEL: Name of the safety (Llama-Guard) model to use (default: meta-llama/Llama-Guard-3-1B)

Setting up Inference server using Dell Enterprise Hub's custom TGI container.

NOTE: This is a placeholder to run inference with TGI. This will be updated to use Dell Enterprise Hub's containers once verified.

export INFERENCE_PORT=8181
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
export CHROMADB_HOST=localhost
export CHROMADB_PORT=6601
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
export CUDA_VISIBLE_DEVICES=0
export LLAMA_STACK_PORT=8321

docker run --rm -it \
  --pull always \
  --network host \
  -v $HOME/.cache/huggingface:/data \
  -e HF_TOKEN=$HF_TOKEN \
  -p $INFERENCE_PORT:$INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --cuda-memory-fraction 0.7 \
  --model-id $INFERENCE_MODEL \
  --port $INFERENCE_PORT --hostname 0.0.0.0

If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like:

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1

docker run --rm -it \
  --pull always \
  --network host \
  -v $HOME/.cache/huggingface:/data \
  -e HF_TOKEN=$HF_TOKEN \
  -p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
  --gpus $CUDA_VISIBLE_DEVICES \
  ghcr.io/huggingface/text-generation-inference \
  --dtype bfloat16 \
  --usage-stats off \
  --sharded false \
  --cuda-memory-fraction 0.7 \
  --model-id $SAFETY_MODEL \
  --hostname 0.0.0.0 \
  --port $SAFETY_INFERENCE_PORT

Dell distribution relies on ChromaDB for vector database usage

You can start a chroma-db easily using docker.

# This is where the indices are persisted
mkdir -p $HOME/chromadb

docker run --rm -it \
  --network host \
  --name chromadb \
  -v $HOME/chromadb:/chroma/chroma \
  -e IS_PERSISTENT=TRUE \
  chromadb/chroma:latest \
  --port $CHROMADB_PORT \
  --host $CHROMADB_HOST

Running Llama Stack

Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

docker run -it \
  --pull always \
  --network host \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v $HOME/.llama:/root/.llama \
  # NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
  -v $HOME/git/llama-stack:/app/llama-stack-source -v $HOME/git/llama-models:/app/llama-models-source \
  # localhost/distribution-dell:dev if building / testing locally
  -e INFERENCE_MODEL=$INFERENCE_MODEL \
  -e DEH_URL=$DEH_URL \
  -e CHROMA_URL=$CHROMA_URL \
  llamastack/distribution-dell \
  --port $LLAMA_STACK_PORT

If you are using Llama Stack Safety / Shield APIs, use:

# You need a local checkout of llama-stack to run this, get it using
# git clone https://github.com/meta-llama/llama-stack.git
cd /path/to/llama-stack

export SAFETY_INFERENCE_PORT=8282
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v $HOME/.llama:/root/.llama \
  -v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-config.yaml \
  -e INFERENCE_MODEL=$INFERENCE_MODEL \
  -e DEH_URL=$DEH_URL \
  -e SAFETY_MODEL=$SAFETY_MODEL \
  -e DEH_SAFETY_URL=$DEH_SAFETY_URL \
  -e CHROMA_URL=$CHROMA_URL \
  llamastack/distribution-dell \
  --config /root/my-config.yaml \
  --port $LLAMA_STACK_PORT

Via venv

Install the distribution dependencies before launching:

llama stack list-deps dell | xargs -L1 uv pip install
INFERENCE_MODEL=$INFERENCE_MODEL \
DEH_URL=$DEH_URL \
CHROMA_URL=$CHROMA_URL \
llama stack run dell \
  --port $LLAMA_STACK_PORT

If you are using Llama Stack Safety / Shield APIs, use:

INFERENCE_MODEL=$INFERENCE_MODEL \
DEH_URL=$DEH_URL \
SAFETY_MODEL=$SAFETY_MODEL \
DEH_SAFETY_URL=$DEH_SAFETY_URL \
CHROMA_URL=$CHROMA_URL \
llama stack run ./run-with-safety.yaml \
  --port $LLAMA_STACK_PORT

Environment Variables​

Setting up Inference server using Dell Enterprise Hub's custom TGI container.​

Dell distribution relies on ChromaDB for vector database usage​

Running Llama Stack​

Via Docker​

Via venv​