Skip to main content
Version: v0.2.23

Retrieval Augmented Generation (RAG)

RAG enables your applications to reference and recall information from previous interactions or external documents.

Architecture Overview

Llama Stack organizes the APIs that enable RAG into three layers:

  1. Lower-Level APIs: Deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon)
  2. RAG Tool: A first-class tool as part of the Tools API that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly
  3. Agents API: The top-level Agents API that allows you to create agents that can use the tools to answer questions, perform tasks, and more

RAG System Architecture

The RAG system uses lower-level storage for different types of data:

  • Vector IO: For semantic search and retrieval
  • Key-Value and Relational IO: For structured data storage
Future Storage Types

We may add more storage types like Graph IO in the future.

Setting up Vector Databases

For this guide, we will use Ollama as the inference provider. Ollama is an LLM runtime that allows you to run Llama models locally.

Here's how to set up a vector database for RAG:

# Create HTTP client
import os
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

# Register a vector database
vector_db_id = "my_documents"
response = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
provider_id="faiss",
)

Document Ingestion

You can ingest documents into the vector database using two methods: directly inserting pre-chunked documents or using the RAG Tool.

Direct Document Insertion

# You can insert a pre-chunked document directly into the vector db
chunks = [
{
"content": "Your document text here",
"mime_type": "text/plain",
"metadata": {
"document_id": "doc1",
"author": "Jane Doe",
},
},
]
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)

Document Retrieval

You can query the vector database to retrieve documents based on their embeddings.

# You can then query for these chunks
chunks_response = client.vector_io.query(
vector_db_id=vector_db_id,
query="What do you know about..."
)

Using the RAG Tool

Deprecation Notice

The RAG Tool is being deprecated in favor of directly using the OpenAI-compatible Search API. We recommend migrating to the OpenAI APIs for better compatibility and future support.

A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces. More examples for how to format a RAGDocument can be found in the appendix.

OpenAI API Integration & Migration

The RAG tool has been updated to use OpenAI-compatible APIs. This provides several benefits:

  • Files API Integration: Documents are now uploaded using OpenAI's file upload endpoints
  • Vector Stores API: Vector storage operations use OpenAI's vector store format with configurable chunking strategies
  • Error Resilience: When processing multiple documents, individual failures are logged but don't crash the operation. Failed documents are skipped while successful ones continue processing.

Migration Path

We recommend migrating to the OpenAI-compatible Search API for:

  1. Better OpenAI Ecosystem Integration: Direct compatibility with OpenAI tools and workflows including the Responses API
  2. Future-Proof: Continued support and feature development
  3. Full OpenAI Compatibility: Vector Stores, Files, and Search APIs are fully compatible with OpenAI's Responses API

The OpenAI APIs are used under the hood, so you can continue to use your existing RAG Tool code with minimal changes. However, we recommend updating your code to use the new OpenAI-compatible APIs for better long-term support. If any documents fail to process, they will be logged in the response but will not cause the entire operation to fail.

RAG Tool Example

from llama_stack_client import RAGDocument

urls = ["memory_optimizations.rst", "chat.rst", "llama3.rst"]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
metadata={},
)
for i, url in enumerate(urls)
]

client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)

# Query documents
results = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id],
content="What do you know about...",
)

Custom Context Configuration

You can configure how the RAG tool adds metadata to the context if you find it useful for your application:

# Query documents with custom template
results = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id],
content="What do you know about...",
query_config={
"chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
},
)

Building RAG-Enhanced Agents

One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:

from llama_stack_client import Agent

# Create agent with memory
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
# Defaults
"query_config": {
"chunk_size_in_tokens": 512,
"chunk_overlap_in_tokens": 0,
"chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
},
},
}
],
)
session_id = agent.create_session("rag_session")

# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
response = agent.create_turn(
messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
session_id=session_id,
)
Agent Instructions

The instructions field in the AgentConfig can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.

Document-Aware Conversations

You can also pass documents along with the user's message and ask questions about them:

# Initial document ingestion
response = agent.create_turn(
messages=[
{"role": "user", "content": "I am providing some documents for reference."}
],
documents=[
{
"content": "https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/memory_optimizations.rst",
"mime_type": "text/plain",
}
],
session_id=session_id,
)

# Query with RAG
response = agent.create_turn(
messages=[{"role": "user", "content": "What are the key topics in the documents?"}],
session_id=session_id,
)

Viewing Agent Responses

You can print the response with the following:

from llama_stack_client import AgentEventLogger

for log in AgentEventLogger().log(response):
log.print()

Vector Database Management

Unregistering Vector DBs

If you need to clean up and unregister vector databases, you can do so as follows:

# Unregister a specified vector database
vector_db_id = "my_vector_db_id"
print(f"Unregistering vector database: {vector_db_id}")
client.vector_dbs.unregister(vector_db_id=vector_db_id)

Best Practices

🎯 Document Chunking

  • Use appropriate chunk sizes (512 tokens is often a good starting point)
  • Consider overlap between chunks for better context preservation
  • Experiment with different chunking strategies for your content type

🔍 Embedding Strategy

  • Choose embedding models that match your domain
  • Consider the trade-off between embedding dimension and performance
  • Test different embedding models for your specific use case

📊 Query Optimization

  • Use specific, well-formed queries for better retrieval
  • Experiment with different search strategies
  • Consider hybrid approaches (keyword + semantic search)

🛡️ Error Handling

  • Implement proper error handling for failed document processing
  • Monitor ingestion success rates
  • Have fallback strategies for retrieval failures

Appendix

More RAGDocument Examples

Here are various ways to create RAGDocument objects for different content types:

from llama_stack_client import RAGDocument
import base64

# File URI
RAGDocument(document_id="num-0", content={"uri": "file://path/to/file"})

# Plain text
RAGDocument(document_id="num-1", content="plain text")

# Explicit text input
RAGDocument(
document_id="num-2",
content={
"type": "text",
"text": "plain text input",
}, # for inputs that should be treated as text explicitly
)

# Image from URL
RAGDocument(
document_id="num-3",
content={
"type": "image",
"image": {"url": {"uri": "https://mywebsite.com/image.jpg"}},
},
)

# Base64 encoded image
B64_ENCODED_IMAGE = base64.b64encode(
requests.get(
"https://raw.githubusercontent.com/meta-llama/llama-stack/refs/heads/main/docs/_static/llama-stack.png"
).content
)
RAGDocument(
document_id="num-4",
content={"type": "image", "image": {"data": B64_ENCODED_IMAGE}},
)

For more strongly typed interaction use the typed dicts found here.