RAG Benchmarks

Llama Stack implements an OpenAI-compatible API surface for RAG — Files, Vector Stores, and the Responses API with the file_search tool. This page presents benchmark results comparing Llama Stack's RAG quality against OpenAI SaaS, and describes the evaluation methodology.

Summary of Results

We evaluated Llama Stack against OpenAI across four benchmark suites covering retrieval accuracy, end-to-end answer quality, and multi-turn conversational RAG. Llama Stack was tested in two search modes: vector (pure semantic search) and hybrid (semantic + keyword search with reranking).

Retrieval Quality (BEIR)

Dataset	Corpus Size	Queries	OpenAI	Llama Stack (vector)	Llama Stack (hybrid)
nfcorpus	3,633	323	0.3156	0.3106	0.3350
scifact	5,183	300	0.7165	0.6943	0.7137
arguana	8,674	1,406	0.2960	0.3765	0.3835
fiqa	57,638	648	0.2862	0.2399	0.2170

Metric: nDCG@10 (normalized Discounted Cumulative Gain at rank 10). Higher is better.

End-to-End RAG Quality

Benchmark	Type	OpenAI (F1)	Llama Stack vector (F1)	Llama Stack hybrid (F1)
MultiHOP RAG	Multi-hop reasoning	0.0114	0.0141	0.0141
Doc2Dial	Document-grounded dialogue	0.1337	0.0962	0.0966

Metric: Token-level F1 (SQuAD-style). Higher is better. All end-to-end benchmarks used GPT-4.1 as the generation model.

Key Takeaways

Hybrid search matches or beats OpenAI on 3 of 4 BEIR datasets (nfcorpus, scifact, arguana), demonstrating that Llama Stack's combination of semantic search, keyword search, and reranking is competitive with OpenAI's proprietary retrieval.
Llama Stack outperforms OpenAI on arguana by 29% (0.3835 vs 0.2960), a dataset that benefits from keyword matching on argumentative text.
fiqa is the gap — OpenAI leads by 19% on this financial QA dataset (0.2862 vs 0.2399), likely due to differences in embedding models and chunking strategies for longer financial documents.
End-to-end RAG scores are low across both backends, reflecting the difficulty of these benchmarks (multi-hop reasoning, document-grounded dialogue) rather than a RAG-specific weakness. Both backends use the same LLM (GPT-4.1) for generation.

Methodology

API Surface Tested

The benchmark suite exercises three layers of the OpenAI-compatible API:

Files API (POST /v1/files) — Upload documents as individual files
Vector Stores API (POST /v1/vector_stores, POST /v1/vector_stores/{id}/files) — Create vector stores and attach files for automatic chunking and embedding
Vector Stores Search API (POST /v1/vector_stores/{id}/search) — Direct retrieval evaluation (BEIR)
Responses API (POST /v1/responses with file_search tool) — End-to-end RAG with automatic retrieval and generation (MultiHOP, Doc2Dial)

The same benchmark code runs against both OpenAI and Llama Stack — the only difference is the --base-url flag.

Llama Stack Configuration

Component	Configuration
Embedding model	`nomic-ai/nomic-embed-text-v1.5` (sentence-transformers)
Reranker model	`Qwen/Qwen3-Reranker-0.6B` (transformers)
Vector database	Milvus (standalone, remote)
Chunk size	512 tokens
Chunk overlap	128 tokens
Hybrid search	RRF fusion (impact factor 60.0) with reranker
Generation model	GPT-4.1 (via `remote::openai` provider)

Benchmarks

BEIR (Retrieval-Only)

BEIR is a standard information retrieval benchmark. We evaluate on four datasets spanning biomedical, scientific, argumentative, and financial domains:

nfcorpus — Biomedical information retrieval (3,633 documents, 323 queries)
scifact — Scientific fact verification (5,183 documents, 300 queries)
arguana — Counterargument retrieval (8,674 documents, 1,406 queries)
fiqa — Financial opinion QA (57,638 documents, 648 queries)

BEIR benchmarks are retrieval-only — no LLM is involved. Documents are uploaded via the Files API, indexed in a Vector Store, and queries are evaluated using the Vector Stores Search API. Results are scored with pytrec_eval against BEIR's ground-truth relevance judgments.

MultiHOP RAG (End-to-End)

MultiHOP RAG tests multi-hop reasoning over news articles. The system must retrieve evidence from multiple documents and synthesize an answer. Queries are sent to the Responses API with the file_search tool, and answers are evaluated with Exact Match, token-level F1, and ROUGE-L.

Doc2Dial (Document-Grounded Dialogue)

Doc2Dial evaluates document-grounded dialogue across government and social service domains. Each conversation is a multi-turn exchange between a user and an agent, grounded in a specific document. Conversations are threaded using previous_response_id to maintain context across turns, matching how production chat applications work.

Metrics

Metric	Used In	Description
nDCG@10	BEIR	Normalized Discounted Cumulative Gain at rank 10 — measures ranking quality
Recall@10	BEIR	Fraction of relevant documents retrieved in top 10
MAP@10	BEIR	Mean Average Precision at rank 10
Exact Match	MultiHOP, Doc2Dial	Whether the prediction exactly matches the ground truth
F1	MultiHOP, Doc2Dial	Token-level F1 (SQuAD-style precision/recall on answer tokens)
ROUGE-L	MultiHOP, Doc2Dial	Longest common subsequence overlap between prediction and ground truth

Running the Benchmarks

The benchmark suite lives in benchmarking/rag/. See the README for full setup instructions.

Quick Start

cd benchmarking/rag
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Run Against OpenAI

python run_benchmark.py --benchmark beir --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark multihop --base-url https://api.openai.com/v1
python run_benchmark.py --benchmark doc2dial --base-url https://api.openai.com/v1

Run Against Llama Stack

# Start Llama Stack with Milvus backend
bash start_stack.sh

# Vector search mode
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1

# Hybrid search mode (recommended)
python run_benchmark.py --benchmark beir --base-url http://localhost:8321/v1 --search-mode hybrid

Compare Results

After running benchmarks against multiple backends:

python compare_results.py              # Table output
python compare_results.py --format csv # CSV for spreadsheets

Extending the Benchmarks

The suite is designed to be extended with new benchmarks. Each benchmark implements the BenchmarkRunner interface:

from benchmarks.base import BenchmarkRunner

class MyBenchmark(BenchmarkRunner):
    name = "my_benchmark"

    def download(self) -> None:
        """Download or load the dataset."""
        ...

    def load_data(self) -> None:
        """Parse the dataset into corpus, queries, and ground truths."""
        ...

    def ingest(self) -> None:
        """Upload corpus to Files API and create a Vector Store."""
        ...

    def evaluate(self) -> dict:
        """Run queries and compute metrics."""
        ...

Summary of Results​

Retrieval Quality (BEIR)​

End-to-End RAG Quality​

Key Takeaways​

Methodology​

API Surface Tested​

Llama Stack Configuration​

Benchmarks​

BEIR (Retrieval-Only)​

MultiHOP RAG (End-to-End)​

Doc2Dial (Document-Grounded Dialogue)​

Metrics​

Running the Benchmarks​

Quick Start​

Run Against OpenAI​

Run Against Llama Stack​

Compare Results​

Extending the Benchmarks​