Version: v0.3.2

Telemetry

The Llama Stack uses OpenTelemetry to provide comprehensive tracing, metrics, and logging capabilities.

Automatic Metrics Generation

Llama Stack automatically generates metrics during inference operations. These metrics are aggregated at the inference request level and provide insights into token usage and model performance.

Available Metrics

The following metrics are automatically generated for each inference request:

Metric Name	Type	Unit	Description	Labels
`llama_stack_prompt_tokens_total`	Counter	`tokens`	Number of tokens in the input prompt	`model_id`, `provider_id`
`llama_stack_completion_tokens_total`	Counter	`tokens`	Number of tokens in the generated response	`model_id`, `provider_id`
`llama_stack_tokens_total`	Counter	`tokens`	Total tokens used (prompt + completion)	`model_id`, `provider_id`

Metric Generation Flow

Token Counting: During inference operations (chat completion, completion, etc.), the system counts tokens in both input prompts and generated responses
Metric Construction: For each request, MetricEvent objects are created with the token counts
Telemetry Logging: Metrics are sent to the configured telemetry sinks
OpenTelemetry Export: When OpenTelemetry is enabled, metrics are exposed as standard OpenTelemetry counters

Metric Aggregation Level

All metrics are generated and aggregated at the inference request level. This means:

Each individual inference request generates its own set of metrics
Metrics are not pre-aggregated across multiple requests
Aggregation (sums, averages, etc.) can be performed by your observability tools (Prometheus, Grafana, etc.)
Each metric includes labels for model_id and provider_id to enable filtering and grouping

Example Metric Event

MetricEvent(
    trace_id="1234567890abcdef",
    span_id="abcdef1234567890",
    metric="total_tokens",
    value=150,
    timestamp=1703123456.789,
    unit="tokens",
    attributes={
        "model_id": "meta-llama/Llama-3.2-3B-Instruct",
        "provider_id": "tgi"
    },
)

Telemetry Sinks

Choose from multiple sink types based on your observability needs:

OpenTelemetry
Console

Send events to an OpenTelemetry Collector for integration with observability platforms:

Use Cases:

Visualizing traces in tools like Jaeger
Collecting metrics for Prometheus
Integration with enterprise observability stacks

Features:

Standard OpenTelemetry format
Compatible with all OpenTelemetry collectors
Supports both traces and metrics

Configuration

Meta-Reference Provider

Currently, only the meta-reference provider is implemented. It can be configured to send events to multiple sink types:

telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: "llama-stack-service"
      sinks: ['console', 'otel_trace', 'otel_metric']
      otel_exporter_otlp_endpoint: "http://localhost:4318"

Environment Variables

Configure telemetry behavior using environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT: OpenTelemetry Collector endpoint (default: http://localhost:4318)
OTEL_SERVICE_NAME: Service name for telemetry (default: empty string)
TELEMETRY_SINKS: Comma-separated list of sinks (default: [])

Quick Setup: Complete Telemetry Stack

Use the automated setup script to launch the complete telemetry stack (Jaeger, OpenTelemetry Collector, Prometheus, and Grafana):

./scripts/telemetry/setup_telemetry.sh

This sets up:

Jaeger UI: http://localhost:16686 (traces visualization)
Prometheus: http://localhost:9090 (metrics)
Grafana: http://localhost:3000 (dashboards with auto-configured data sources)
OTEL Collector: http://localhost:4318 (OTLP endpoint)

Once running, you can visualize traces by navigating to Grafana and login with login admin and password admin.

Querying Metrics

When using the OpenTelemetry sink, metrics are exposed in standard format and can be queried through various tools:

Prometheus Queries
Grafana Dashboards
OpenTelemetry Collector

Example Prometheus queries for analyzing token usage:

# Total tokens used across all models
sum(llama_stack_tokens_total)

# Tokens per model
sum by (model_id) (llama_stack_tokens_total)

# Average tokens per request over 5 minutes
rate(llama_stack_tokens_total[5m])

# Token usage by provider
sum by (provider_id) (llama_stack_tokens_total)

Best Practices

🔍 Monitoring Strategy

Use OpenTelemetry for production environments
Set up alerts on key metrics like token usage and error rates

📊 Metrics Analysis

Track token usage trends to optimize costs
Monitor response times across different models
Analyze usage patterns to improve resource allocation

🚨 Alerting & Debugging

Set up alerts for unusual token consumption spikes
Use trace data to debug performance issues
Monitor error rates and failure patterns

🔧 Configuration Management

Use environment variables for flexible deployment
Ensure proper network access to OpenTelemetry collectors

Agents - Monitoring agent execution with telemetry
Evaluations - Using telemetry data for performance evaluation
Getting Started Notebook - Telemetry examples and queries
OpenTelemetry Documentation - Comprehensive observability framework
Jaeger Documentation - Distributed tracing visualization

Automatic Metrics Generation​

Available Metrics​

Metric Generation Flow​

Metric Aggregation Level​

Example Metric Event​

Telemetry Sinks​

Configuration​

Meta-Reference Provider​

Environment Variables​

Quick Setup: Complete Telemetry Stack​

Querying Metrics​

Best Practices​

🔍 Monitoring Strategy​

📊 Metrics Analysis​

🚨 Alerting & Debugging​

🔧 Configuration Management​

Related Resources​