Skip to main content
Version: Next

iOS SDK

We offer both remote and on-device use of Llama Stack in Swift via a single SDK llama-stack-client-swift that contains two components:

  1. LlamaStackClient for remote
  2. Local Inference for on-device
:alt: Seamlessly switching between local, on-device inference and remote hosted inference
:width: 412px
:align: center

Remote Only​

If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.

  1. Add https://github.com/meta-llama/llama-stack-client-swift/ as a Package Dependency in Xcode

  2. Add LlamaStackClient as a framework to your app target

  3. Call an API:

import LlamaStackClient

let agents = RemoteAgents(url: URL(string: "http://localhost:8321")!)
let request = Components.Schemas.CreateAgentTurnRequest(
agent_id: agentId,
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Hello Llama!"),
role: .user
))
],
session_id: self.agenticSystemSessionId,
stream: true
)

for try await chunk in try await agents.createTurn(request: request) {
let payload = chunk.event.payload
// ...

Check out iOSCalendarAssistant for a complete app demo.

LocalInference​

LocalInference provides a local inference implementation powered by executorch.

Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using executorch, PyTorch’s on-device inference library.

The APIs work the same as remote – the only difference is you'll instead use the LocalAgents / LocalInference classes and pass in a DispatchQueue:

private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
let inference = LocalInference(queue: runnerQueue)
let agents = LocalAgents(inference: self.inference)

Check out iOSCalendarAssistantWithLocalInf for a complete app demo.

Installation​

We're working on making LocalInference easier to set up. For now, you'll need to import it via .xcframework:

  1. Clone the executorch submodule in this repo and its dependencies: git submodule update --init --recursive
  2. Install Cmake for the executorch build`
  3. Drag LocalInference.xcodeproj into your project
  4. Add LocalInference as a framework in your app target

Preparing a model​

  1. Prepare a .pte file following the executorch docs
  2. Bundle the .pte and tokenizer.model file into your app

We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):

Llama 3.2 1BTokens / Second (total)Time-to-First-Token (sec)
HaikuParagraphHaikuParagraph
BF162.22.52.31.9
QAT+LoRA7.13.30.370.24
SpinQuant10.15.20.20.2

Using LocalInference​

  1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
  init () {
runnerQueue = DispatchQueue(label: "org.meta.llamastack")
inferenceService = LocalInferenceService(queue: runnerQueue)
agentsService = LocalAgentsService(inference: inferenceService)
}
  1. Before making any inference calls, load your model from your bundle:
let mainBundle = Bundle.main
inferenceService.loadModel(
modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
completion: {_ in } // use to handle load failures
)
  1. Make inference calls (or agents calls) as you normally would with LlamaStack:
for await chunk in try await agentsService.initAndCreateTurn(
messages: [
.UserMessage(Components.Schemas.UserMessage(
content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
role: .user))
]
) {

Troubleshooting​

If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:

(Opt+Click) Product > Clean Build Folder Immediately

rm -rf \
~/Library/org.swift.swiftpm \
~/Library/Caches/org.swift.swiftpm \
~/Library/Caches/com.apple.dt.Xcode \
~/Library/Developer/Xcode/DerivedData