Version: Next

Quickstart

Get started with Llama Stack in minutes!

Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different environments. You can build and test using a local server first and deploy to a hosted endpoint for production.

In this guide, we'll walk through how to build a RAG application locally using Llama Stack with Ollama as the inference provider for a Llama Model.

💡 Notebook Version: You can also follow this quickstart guide in a Jupyter notebook format: quick_start.ipynb

Step 1: Install and setup

Install uv
Run inference on a Llama model with Ollama

ollama run llama3.2:3b --keepalive 60m

Step 2: Run the Llama Stack server

demo_script.py
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.

"""
Demo script showing RAG with both Responses API and Chat Completions API.

This example demonstrates two approaches to RAG with Llama Stack:
1. Responses API - Automatic agentic tool calling with file search
2. Chat Completions API - Manual retrieval with explicit control

Run this script after starting a Llama Stack server:
    llama stack run starter
"""

import io
import os

import requests
from openai import OpenAI

# Initialize OpenAI client pointing to Llama Stack server
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

print("RAG demonstration\n")

url = "https://www.paulgraham.com/greatwork.html"
print(f"Fetching document from: {url}")

vs = client.vector_stores.create()

response = requests.get(url)
pseudo_file = io.BytesIO(str(response.content).encode("utf-8"))
uploaded_file = client.files.create(
    file=(url, pseudo_file, "text/html"), purpose="assistants"
)
client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
print(f"File uploaded and added to vector store: {uploaded_file.id}")

query = "How do you do great work?"
print(f"Query: {query}\n")

print(
    """
RAG using Responses API:
   - Automatic tool calling (model decides when to search)
   - Simpler code, less control
   - Best for: Conversational agents, automatic workflows

"""
)

print("Reply via Responses API:\n")
resp = client.responses.create(
    model=os.getenv("INFERENCE_MODEL", "ollama/llama3.2:3b"),
    input=query,
    tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
    include=["file_search_call.results"],
)

print("-" * 80)
print(resp.output[-1].content[-1].text)
print("-" * 80)

print(
    """

RAG using Chat Completions API:
   - Manual retrieval (you control the search)
   - More code, more control
   - Best for: Custom RAG patterns, batch processing, specialized workflows
"""
)

print("Searching vector store...")
search_results = client.vector_stores.search(
    vector_store_id=vs.id, query=query, max_num_results=3, rewrite_query=False
)

# Extract context from search results
context_chunks = []
for result in search_results.data:
    # result.content is a list of Content objects, extract the text from each
    for content_item in result.content:
        context_chunks.append(content_item.text)

context = "\n\n".join(context_chunks)
print(f"Found {len(context_chunks)} relevant chunks\n")

print("Reply via Chat Completions API:\n")
completion = client.chat.completions.create(
    model=os.getenv("INFERENCE_MODEL", "ollama/llama3.2:3b"),
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use the provided context to answer the user's question.",
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based on the context above.",
        },
    ],
    temperature=0.7,
)

print("-" * 80)
print(completion.choices[0].message.content)
print("-" * 80)

We will use uv to install dependencies and run the Llama Stack server.

# Install dependencies for the starter distribution
uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install

# Run the server
OLLAMA_URL=http://localhost:11434 uv run --with llama-stack llama stack run starter

Step 3: Run the demo

Now open up a new terminal and copy the following script into a file named demo_script.py.

We will use uv to run the script

uv run --with llama-stack-client,fire,requests demo_script.py

And you should see output like below.

>print(resp.output[1].content[0].text)
To do great work, consider the following principles:

1. **Follow Your Interests**: Engage in work that genuinely excites you. If you find an area intriguing, pursue it without being overly concerned about external pressures or norms. You should create things that you would want for yourself, as this often aligns with what others in your circle might want too.

2. **Work Hard on Ambitious Projects**: Ambition is vital, but it should be tempered by genuine interest. Instead of detailed planning for the future, focus on exciting projects that keep your options open. This approach, known as "staying upwind," allows for adaptability and can lead to unforeseen achievements.

3. **Choose Quality Colleagues**: Collaborating with talented colleagues can significantly affect your own work. Seek out individuals who offer surprising insights and whom you admire. The presence of good colleagues can elevate the quality of your work and inspire you.

4. **Maintain High Morale**: Your attitude towards work and life affects your performance. Cultivating optimism and viewing yourself as lucky rather than victimized can boost your productivity. It’s essential to care for your physical health as well since it directly impacts your mental faculties and morale.

5. **Be Consistent**: Great work often comes from cumulative effort. Daily progress, even in small amounts, can result in substantial achievements over time. Emphasize consistency and make the work engaging, as this reduces the perceived burden of hard labor.

6. **Embrace Curiosity**: Curiosity is a driving force that can guide you in selecting fields of interest, pushing you to explore uncharted territories. Allow it to shape your work and continually seek knowledge and insights.

By focusing on these aspects, you can create an environment conducive to great work and personal fulfillment.

Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳

HuggingFace access

If you are getting a 401 Client Error from HuggingFace for the all-MiniLM-L6-v2 model, try setting HF_TOKEN to a valid HuggingFace token in your environment

Next Steps

Now you're ready to dive deeper into Llama Stack!

Explore the Detailed Tutorial.
Try the Getting Started Notebook.
Browse more Notebooks on GitHub.
Learn about Llama Stack Concepts.
Discover how to Build Llama Stacks.
Refer to our References for details on the Llama CLI and Python SDK.
Check out the llama-stack-apps repository for example applications and tutorials.

Step 1: Install and setup​

Step 2: Run the Llama Stack server​

Step 3: Run the demo​

Next Steps​

Step 1: Install and setup

Step 2: Run the Llama Stack server

Step 3: Run the demo

Next Steps