Skip to main content

RAG Benchmarks

Llama Stack implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing Llama Stack's RAG quality against OpenAI SaaS, and describes the evaluation methodology.

Summary of Results

We evaluated Llama Stack against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. Llama Stack was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking).

Retrieval Quality (BEIR)

DatasetCorpus SizeQueriesOpenAILlama Stack (vector)Llama Stack (hybrid)
nfcorpus3,6333230.31560.31060.3350
scifact5,1833000.71650.69430.7137
arguana8,6741,4060.29600.37650.3835
fiqa57,6386480.28620.23990.2170

Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.

End-to-End RAG Quality

BenchmarkTypeOpenAI (F1)Llama Stack vector (F1)Llama Stack hybrid (F1)
MultiHOP RAGMulti-hop reasoning0.01140.01410.0141
Doc2DialDocument-grounded dialogue0.13370.09620.0966

Metric: Token-level F1 (SQuAD-style). Higher is better. All end-to-end benchmarks used GPT-4.1 as the generation model.

Key Takeaways

  • Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that Llama Stack's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
  • Llama Stack outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
  • fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
  • End-to-end RAG scores are low across both backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness. Both backends use the same LLM (GPT-4.1) for generation.

Methodology

API Surface Tested

The benchmark suite exercises three layers of the OpenAI-compatible API:

  1. Files API (POST /v1/files) — Upload documents as individual files
  2. Vector Stores API (POST /v1/vector_stores, POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding
  3. Vector Stores Search API (POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR)
  4. Responses API (POST /v1/responses with file_search tool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)

The same benchmark code runs against both OpenAI and Llama Stack — the only difference is the --base-url flag.

Llama Stack Configuration

ComponentConfiguration
Embedding modelnomic-ai/nomic-embed-text-v1.5 (sentence-transformers)
Reranker modelQwen/Qwen3-Reranker-0.6B (transformers)
Vector databaseMilvus (standalone, remote)
Chunk size512 tokens
Chunk overlap128 tokens
Hybrid searchRRF fusion (impact factor 60.0) with reranker
Generation modelGPT-4.1 (via remote::openai provider)

Benchmarks

BEIR (Retrieval-Only)

BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:

  • nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
  • scifact — Scientific fact verification (5,183 documents, 300 queries)
  • arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
  • fiqa — Financial opinion QA (57,638 documents, 648 queries)

BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.

MultiHOP RAG (End-to-End)

MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.

Doc2Dial (Document-Grounded Dialogue)

Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.

Metrics

MetricUsed InDescription
nDCG@10BEIRNormalized Discounted Cumulative Gain at rank 10 — measures ranking quality
Recall@10BEIRFraction of relevant documents retrieved in top 10
MAP@10BEIRMean Average Precision at rank 10
Exact MatchMultiHOP, Doc2DialWhether the prediction exactly matches the ground truth
F1MultiHOP, Doc2DialToken-level F1 (SQuAD-style precision/recall on answer tokens)
ROUGE-LMultiHOP, Doc2DialLongest common subsequence overlap between prediction and ground truth

Running the Benchmarks

The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.

Quick Start

cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Run Against OpenAI

python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1

Run Against Llama Stack

# Start Llama Stack with Milvus backend
bash start_stack.sh

# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1

# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid

Compare Results

After running benchmarks against multiple backends:

python compare_results.py              # Table output
python compare_results.py --format csv # CSV for spreadsheets

Extending the Benchmarks

The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:

from benchmarks.base import BenchmarkRunner

class MyBenchmark(BenchmarkRunner):
name = "my_benchmark"

def download(self) -> None:
"""Download or load the dataset."""
...

def load_data(self) -> None:
"""Parse the dataset into corpus, queries, and ground truths."""
...

def ingest(self) -> None:
"""Upload corpus to Files API and create a Vector Store."""
...

def evaluate(self) -> dict:
"""Run queries and compute metrics."""
...

Register the new benchmark in run_benchmark.py and it will be available via --benchmark my_benchmark.