RAG Benchmarks
Llama Stack implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing Llama Stack's RAG quality against OpenAI SaaS, and describes the evaluation methodology.
Summary of Results
We evaluated Llama Stack against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. Llama Stack was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking).
Retrieval Quality (BEIR)
| Dataset | Corpus Size | Queries | OpenAI | Llama Stack (vector) | Llama Stack (hybrid) |
|---|---|---|---|---|---|
| nfcorpus | 3,633 | 323 | 0.3156 | 0.3106 | 0.3350 |
| scifact | 5,183 | 300 | 0.7165 | 0.6943 | 0.7137 |
| arguana | 8,674 | 1,406 | 0.2960 | 0.3765 | 0.3835 |
| fiqa | 57,638 | 648 | 0.2862 | 0.2399 | 0.2170 |
Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.
End-to-End RAG Quality
| Benchmark | Type | OpenAI (F1) | Llama Stack vector (F1) | Llama Stack hybrid (F1) |
|---|---|---|---|---|
| MultiHOP RAG | Multi-hop reasoning | 0.0114 | 0.0141 | 0.0141 |
| Doc2Dial | Document-grounded dialogue | 0.1337 | 0.0962 | 0.0966 |
Metric: Token-level F1 (SQuAD-style). Higher is better. All end-to-end benchmarks used GPT-4.1 as the generation model.
Key Takeaways
- Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that Llama Stack's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
- Llama Stack outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
- fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
- End-to-end RAG scores are low across both backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness. Both backends use the same LLM (GPT-4.1) for generation.
Methodology
API Surface Tested
The benchmark suite exercises three layers of the OpenAI-compatible API:
- Files API (
POST /v1/files) — Upload documents as individual files - Vector Stores API (
POST /v1/vector_stores,POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding - Vector Stores Search API (
POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR) - Responses API (
POST /v1/responseswithfile_searchtool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)
The same benchmark code runs against both OpenAI and Llama Stack — the only difference is the --base-url flag.
Llama Stack Configuration
| Component | Configuration |
|---|---|
| Embedding model | nomic-ai/nomic-embed-text-v1.5 (sentence-transformers) |
| Reranker model | Qwen/Qwen3-Reranker-0.6B (transformers) |
| Vector database | Milvus (standalone, remote) |
| Chunk size | 512 tokens |
| Chunk overlap | 128 tokens |
| Hybrid search | RRF fusion (impact factor 60.0) with reranker |
| Generation model | GPT-4.1 (via remote::openai provider) |
Benchmarks
BEIR (Retrieval-Only)
BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:
- nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
- scifact — Scientific fact verification (5,183 documents, 300 queries)
- arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
- fiqa — Financial opinion QA (57,638 documents, 648 queries)
BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.
MultiHOP RAG (End-to-End)
MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.
Doc2Dial (Document-Grounded Dialogue)
Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.
Metrics
| Metric | Used In | Description |
|---|---|---|
| nDCG@10 | BEIR | Normalized Discounted Cumulative Gain at rank 10 — measures ranking quality |
| Recall@10 | BEIR | Fraction of relevant documents retrieved in top 10 |
| MAP@10 | BEIR | Mean Average Precision at rank 10 |
| Exact Match | MultiHOP, Doc2Dial | Whether the prediction exactly matches the ground truth |
| F1 | MultiHOP, Doc2Dial | Token-level F1 (SQuAD-style precision/recall on answer tokens) |
| ROUGE-L | MultiHOP, Doc2Dial | Longest common subsequence overlap between prediction and ground truth |
Running the Benchmarks
The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.
Quick Start
cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY
Run Against OpenAI
python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1
Run Against Llama Stack
# Start Llama Stack with Milvus backend
bash start_stack.sh
# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1
# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid
Compare Results
After running benchmarks against multiple backends:
python compare_results.py # Table output
python compare_results.py --format csv # CSV for spreadsheets
Extending the Benchmarks
The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:
from benchmarks.base import BenchmarkRunner
class MyBenchmark(BenchmarkRunner):
name = "my_benchmark"
def download(self) -> None:
"""Download or load the dataset."""
...
def load_data(self) -> None:
"""Parse the dataset into corpus, queries, and ground truths."""
...
def ingest(self) -> None:
"""Upload corpus to Files API and create a Vector Store."""
...
def evaluate(self) -> dict:
"""Run queries and compute metrics."""
...
Register the new benchmark in run_benchmark.py and it will be available via --benchmark my_benchmark.