Evaluations
This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the Evaluation Reference guide that covers the complete set of APIs and developer experience flow.
Check out our Colab notebook for working examples with evaluations, or try the Getting Started notebook.
Application Evaluation Example​
Llama Stack offers a library of scoring functions and the /scoring
API, allowing you to run evaluations on your pre-annotated AI application datasets.
In this example, we will show you how to:
- Build an Agent with Llama Stack
- Query the agent's sessions, turns, and steps to analyze execution
- Evaluate the results using scoring functions
Step-by-Step Evaluation Process​
1. Building a Search Agent​
First, let's create an agent that can search the web to answer questions:
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant. Use search tool to answer the questions.",
tools=["builtin::websearch"],
)
# Test prompts for evaluation
user_prompts = [
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
# Execute all prompts in the session
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
2. Query Agent Execution Steps​
Now, let's analyze the agent's execution steps to understand its performance:
- Session Analysis
- Tool Usage Validation
from rich.pretty import pprint
# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
pprint(session_response)
# Sanity check: Verify that all user prompts are followed by tool calls
num_tool_call = 0
for turn in session_response.turns:
for step in turn.steps:
if (
step.step_type == "tool_execution"
and step.tool_calls[0].tool_name == "brave_search"
):
num_tool_call += 1
print(
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)
3. Evaluate Agent Responses​
Now we'll evaluate the agent's responses using Llama Stack's scoring API:
- Data Preparation
- Scoring & Evaluation
# Process agent execution history into evaluation rows
eval_rows = []
# Define expected answers for our test prompts
expected_answers = [
"Dallas Mavericks and the Minnesota Timberwolves",
"Season 4, Episode 12",
"King Cobra",
]
# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
eval_rows.append(
{
"input_query": turn.input_messages[0].content,
"generated_answer": turn.output_message.content,
"expected_answer": expected_answers[i],
}
)
pprint(eval_rows)
# Configure scoring parameters
scoring_params = {
"basic::subset_of": None, # Check if generated answer contains expected answer
}
# Run evaluation using Llama Stack's scoring API
scoring_response = client.scoring.score(
input_rows=eval_rows,
scoring_functions=scoring_params
)
pprint(scoring_response)
# Analyze results
for i, result in enumerate(scoring_response.results):
print(f"Query {i+1}: {result.score}")
print(f" Generated: {eval_rows[i]['generated_answer'][:100]}...")
print(f" Expected: {expected_answers[i]}")
print(f" Score: {result.score}")
print()
Available Scoring Functions​
Llama Stack provides several built-in scoring functions:
Basic Scoring Functions​
basic::subset_of
: Checks if the expected answer is contained in the generated responsebasic::exact_match
: Performs exact string matching between expected and generated answersbasic::regex_match
: Uses regular expressions to match patterns in responses
Advanced Scoring Functions​
llm_as_judge::accuracy
: Uses an LLM to judge response accuracyllm_as_judge::helpfulness
: Evaluates how helpful the response isllm_as_judge::safety
: Assesses response safety and appropriateness
Custom Scoring Functions​
You can also create custom scoring functions for domain-specific evaluation needs.
Evaluation Workflow Best Practices​
🎯 Dataset Preparation​
- Use diverse test cases that cover edge cases and common scenarios
- Include clear expected answers or success criteria
- Balance your dataset across different difficulty levels
📊 Metrics Selection​
- Choose appropriate scoring functions for your use case
- Combine multiple metrics for comprehensive evaluation
- Consider both automated and human evaluation metrics
🔄 Iterative Improvement​
- Run evaluations regularly during development
- Use evaluation results to identify areas for improvement
- Track performance changes over time
📈 Analysis & Reporting​
- Analyze failures to understand model limitations
- Generate comprehensive evaluation reports
- Share results with stakeholders for informed decision-making
Advanced Evaluation Scenarios​
Batch Evaluation​
For evaluating large datasets efficiently:
# Prepare large evaluation dataset
large_eval_dataset = [
{"input_query": query, "expected_answer": answer}
for query, answer in zip(queries, expected_answers)
]
# Run batch evaluation
batch_results = client.scoring.score(
input_rows=large_eval_dataset,
scoring_functions={
"basic::subset_of": None,
"llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
}
)
Multi-Metric Evaluation​
Combining different scoring approaches:
comprehensive_scoring = {
"exact_match": "basic::exact_match",
"subset_match": "basic::subset_of",
"llm_judge": "llm_as_judge::accuracy",
"safety_check": "llm_as_judge::safety",
}
results = client.scoring.score(
input_rows=eval_rows,
scoring_functions=comprehensive_scoring
)
Related Resources​
- Agents - Building agents for evaluation
- Tools Integration - Using tools in evaluated agents
- Evaluation Reference - Complete API reference for evaluations
- Getting Started Notebook - Interactive examples
- Evaluation Examples - Additional evaluation scenarios