Skip to main content
Version: v0.4.3

Using Llama Stack as a Library

Setup Llama Stack without a Server​

If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server.

# setup
uv pip install llama-stack llama-stack-client
llama stack list-deps starter | xargs -L1 uv pip install
from llama_stack.core.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient(
"starter",
# provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
provider_data={"tavily_search_api_key": os.environ["TAVILY_SEARCH_API_KEY"]},
)

This will parse your config and set up any inline implementations and remote clients needed for your implementation.

Then, you can access the APIs like models and inference on the client and call their methods directly:

response = client.models.list()

If you've created a custom distribution, you can also use the config.yaml configuration file directly:

client = LlamaStackAsLibraryClient(config_path)

Resource Management​

When you're done using the client, you should properly release resources such as database connections. There are two ways to do this:

The easiest and most Pythonic way is to use the client as a context manager, which automatically handles cleanup:

# Synchronous client
from llama_stack.core.library_client import LlamaStackAsLibraryClient

with LlamaStackAsLibraryClient("starter") as client:
response = client.models.list()
# Client is automatically shut down here

For the async client:

from llama_stack.core.library_client import AsyncLlamaStackAsLibraryClient

async with AsyncLlamaStackAsLibraryClient("starter") as client:
response = await client.models.list()
# Client is automatically shut down here

Using Explicit shutdown()​

Alternatively, you can manually call shutdown() when you're done:

# Synchronous client
client = LlamaStackAsLibraryClient("starter")
try:
# ... use the client ...
response = client.models.list()
finally:
client.shutdown()

For the async client:

from llama_stack.core.library_client import AsyncLlamaStackAsLibraryClient

client = AsyncLlamaStackAsLibraryClient("starter")
await client.initialize()
try:
# ... use the client ...
response = await client.models.list()
finally:
await client.shutdown()

The shutdown() method:

  • Closes all database connections (SQLite, PostgreSQL, etc.)
  • Releases any held resources
  • Can be called multiple times safely (idempotent)
tip

If you don't call shutdown() or use a context manager, your program may hang on exit while waiting for background threads to complete, especially when using SQLite-based storage backends.