Version: Next

Evaluations

This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the Evaluation Reference guide that covers the complete set of APIs and developer experience flow.

Interactive Examples

Check out our Colab notebook for working examples with evaluations, or try the Getting Started notebook.

Application Evaluation Example

Llama Stack offers a library of scoring functions and the /scoring API, allowing you to run evaluations on your pre-annotated AI application datasets.

In this example, we will show you how to:

Build an Agent with Llama Stack
Query the agent's sessions, turns, and steps to analyze execution
Evaluate the results using scoring functions

Step-by-Step Evaluation Process

1. Building a Search Agent

First, let's create an agent that can search the web to answer questions:

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
    client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions.",
    tools=["builtin::websearch"],
)

# Test prompts for evaluation
user_prompts = [
    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]

session_id = agent.create_session("test-session")

# Execute all prompts in the session
for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in AgentEventLogger().log(response):
        log.print()

2. Query Agent Execution Steps

Now, let's analyze the agent's execution steps to understand its performance:

Session Analysis
Tool Usage Validation

from rich.pretty import pprint

# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
)

pprint(session_response)

# Sanity check: Verify that all user prompts are followed by tool calls
num_tool_call = 0
for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1

print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)

3. Evaluate Agent Responses

Now we'll evaluate the agent's responses using Llama Stack's scoring API:

Data Preparation
Scoring & Evaluation

# Process agent execution history into evaluation rows
eval_rows = []

# Define expected answers for our test prompts
expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
]

# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )

pprint(eval_rows)

# Configure scoring parameters
scoring_params = {
    "basic::subset_of": None,  # Check if generated answer contains expected answer
}

# Run evaluation using Llama Stack's scoring API
scoring_response = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=scoring_params
)

pprint(scoring_response)

# Analyze results
for i, result in enumerate(scoring_response.results):
    print(f"Query {i+1}: {result.score}")
    print(f"  Generated: {eval_rows[i]['generated_answer'][:100]}...")
    print(f"  Expected: {expected_answers[i]}")
    print(f"  Score: {result.score}")
    print()

Available Scoring Functions

Llama Stack provides several built-in scoring functions:

Basic Scoring Functions

basic::subset_of: Checks if the expected answer is contained in the generated response
basic::exact_match: Performs exact string matching between expected and generated answers
basic::regex_match: Uses regular expressions to match patterns in responses

Advanced Scoring Functions

llm_as_judge::accuracy: Uses an LLM to judge response accuracy
llm_as_judge::helpfulness: Evaluates how helpful the response is
llm_as_judge::safety: Assesses response safety and appropriateness

Custom Scoring Functions

You can also create custom scoring functions for domain-specific evaluation needs.

Evaluation Workflow Best Practices

🎯 Dataset Preparation

Use diverse test cases that cover edge cases and common scenarios
Include clear expected answers or success criteria
Balance your dataset across different difficulty levels

📊 Metrics Selection

Choose appropriate scoring functions for your use case
Combine multiple metrics for comprehensive evaluation
Consider both automated and human evaluation metrics

🔄 Iterative Improvement

Run evaluations regularly during development
Use evaluation results to identify areas for improvement
Track performance changes over time

📈 Analysis & Reporting

Analyze failures to understand model limitations
Generate comprehensive evaluation reports
Share results with stakeholders for informed decision-making

Advanced Evaluation Scenarios

Batch Evaluation

For evaluating large datasets efficiently:

# Prepare large evaluation dataset
large_eval_dataset = [
    {"input_query": query, "expected_answer": answer}
    for query, answer in zip(queries, expected_answers)
]

# Run batch evaluation
batch_results = client.scoring.score(
    input_rows=large_eval_dataset,
    scoring_functions={
        "basic::subset_of": None,
        "llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
    }
)

Multi-Metric Evaluation

Combining different scoring approaches:

comprehensive_scoring = {
    "exact_match": "basic::exact_match",
    "subset_match": "basic::subset_of",
    "llm_judge": "llm_as_judge::accuracy",
    "safety_check": "llm_as_judge::safety",
}

results = client.scoring.score(
    input_rows=eval_rows,
    scoring_functions=comprehensive_scoring
)

Agents - Building agents for evaluation
Tools Integration - Using tools in evaluated agents
Evaluation Reference - Complete API reference for evaluations
Getting Started Notebook - Interactive examples
Evaluation Examples - Additional evaluation scenarios

Application Evaluation Example​

Step-by-Step Evaluation Process​

1. Building a Search Agent​

2. Query Agent Execution Steps​

3. Evaluate Agent Responses​

Available Scoring Functions​

Basic Scoring Functions​

Advanced Scoring Functions​

Custom Scoring Functions​

Evaluation Workflow Best Practices​

🎯 Dataset Preparation​

📊 Metrics Selection​

🔄 Iterative Improvement​

📈 Analysis & Reporting​

Advanced Evaluation Scenarios​

Batch Evaluation​

Multi-Metric Evaluation​

Related Resources​