Skip to main content

Inference

Overview

Inference

Llama Stack Inference API for generating completions, chat completions, and embeddings.

This API provides the raw interface to the underlying models. Three kinds of models are supported:

  • LLM models: these models generate "raw" and "chat" (conversational) completions.
  • Embedding models: these models generate embeddings to be used for semantic search.
  • Rerank models: these models reorder the documents based on their relevance to a query.

This section contains documentation for all available providers for the inference API.