Skip to main content
Version: Next

Evaluations

This guide walks you through the process of evaluating an LLM application built using Llama Stack. For detailed API reference, check out the Evaluation Reference guide that covers the complete set of APIs and developer experience flow.

Interactive Examples

Check out our Colab notebook for working examples with evaluations, or try the Getting Started notebook.

Application Evaluation Example​

Open In Colab

Llama Stack offers a library of scoring functions and the /scoring API, allowing you to run evaluations on your pre-annotated AI application datasets.

In this example, we will show you how to:

  1. Build an Agent with Llama Stack
  2. Query the agent's sessions, turns, and steps to analyze execution
  3. Evaluate the results using scoring functions

Step-by-Step Evaluation Process​

1. Building a Search Agent​

First, let's create an agent that can search the web to answer questions:

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant. Use search tool to answer the questions.",
tools=["builtin::websearch"],
)

# Test prompts for evaluation
user_prompts = [
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]

session_id = agent.create_session("test-session")

# Execute all prompts in the session
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)

for log in AgentEventLogger().log(response):
log.print()

2. Query Agent Execution Steps​

Now, let's analyze the agent's execution steps to understand its performance:

from rich.pretty import pprint

# Query the agent's session to get detailed execution data
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)

pprint(session_response)

3. Evaluate Agent Responses​

Now we'll evaluate the agent's responses using Llama Stack's scoring API:

# Process agent execution history into evaluation rows
eval_rows = []

# Define expected answers for our test prompts
expected_answers = [
"Dallas Mavericks and the Minnesota Timberwolves",
"Season 4, Episode 12",
"King Cobra",
]

# Create evaluation dataset from agent responses
for i, turn in enumerate(session_response.turns):
eval_rows.append(
{
"input_query": turn.input_messages[0].content,
"generated_answer": turn.output_message.content,
"expected_answer": expected_answers[i],
}
)

pprint(eval_rows)

Available Scoring Functions​

Llama Stack provides several built-in scoring functions:

Basic Scoring Functions​

  • basic::subset_of: Checks if the expected answer is contained in the generated response
  • basic::exact_match: Performs exact string matching between expected and generated answers
  • basic::regex_match: Uses regular expressions to match patterns in responses

Advanced Scoring Functions​

  • llm_as_judge::accuracy: Uses an LLM to judge response accuracy
  • llm_as_judge::helpfulness: Evaluates how helpful the response is
  • llm_as_judge::safety: Assesses response safety and appropriateness

Custom Scoring Functions​

You can also create custom scoring functions for domain-specific evaluation needs.

Evaluation Workflow Best Practices​

🎯 Dataset Preparation​

  • Use diverse test cases that cover edge cases and common scenarios
  • Include clear expected answers or success criteria
  • Balance your dataset across different difficulty levels

📊 Metrics Selection​

  • Choose appropriate scoring functions for your use case
  • Combine multiple metrics for comprehensive evaluation
  • Consider both automated and human evaluation metrics

🔄 Iterative Improvement​

  • Run evaluations regularly during development
  • Use evaluation results to identify areas for improvement
  • Track performance changes over time

📈 Analysis & Reporting​

  • Analyze failures to understand model limitations
  • Generate comprehensive evaluation reports
  • Share results with stakeholders for informed decision-making

Advanced Evaluation Scenarios​

Batch Evaluation​

For evaluating large datasets efficiently:

# Prepare large evaluation dataset
large_eval_dataset = [
{"input_query": query, "expected_answer": answer}
for query, answer in zip(queries, expected_answers)
]

# Run batch evaluation
batch_results = client.scoring.score(
input_rows=large_eval_dataset,
scoring_functions={
"basic::subset_of": None,
"llm_as_judge::accuracy": {"judge_model": "meta-llama/Llama-3.3-70B-Instruct"},
}
)

Multi-Metric Evaluation​

Combining different scoring approaches:

comprehensive_scoring = {
"exact_match": "basic::exact_match",
"subset_match": "basic::subset_of",
"llm_judge": "llm_as_judge::accuracy",
"safety_check": "llm_as_judge::safety",
}

results = client.scoring.score(
input_rows=eval_rows,
scoring_functions=comprehensive_scoring
)