Skip to main content

Documentation Index

Fetch the complete documentation index at: https://liquidai-fix-android-sdk-qa-issues.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

All functions documented on this page are safe to call from the main/UI thread; callbacks run on the main thread unless explicitly noted. The API surface is identical across iOS, macOS, Android, JVM, and Kotlin/Native β€” only the language and a handful of platform conventions differ.

ModelRunner

A ModelRunner represents a loaded model instance. Obtain one via:
  • Android (recommended): LeapModelDownloader.loadModel(...) / loadSimpleModel(...) β€” one-shot load that transparently routes through the optional Leap Model Service when installed, and adds WorkManager-backed background download staging on top.
  • iOS / macOS (recommended): ModelDownloader.loadModel(...) / loadSimpleModel(...) β€” one-shot load that routes file transfers through URLSession. Pass sessionConfiguration: .background(withIdentifier:) for downloads that survive app suspension. (Class ships in the LeapModelDownloader SPM library product.)
  • All platforms (iOS, Android, JVM, Linux native, Windows native, macOS Kotlin): LeapDownloader.loadModel(...) / loadSimpleModel(...) β€” the cross-platform manifest loader, with no platform-native background integration. Used directly on JVM/native and as the underlying loader inside both the iOS ModelDownloader and Android LeapModelDownloader.
Hold a strong reference for as long as you need to perform generations, then call unload() to release native resources. See Model Loading for full reference.
public protocol ModelRunner {
  func createConversation(systemPrompt: String?) -> Conversation
  func createConversationFromHistory(history: [ChatMessage]) -> Conversation
  func unload() async
  func getPromptTokensSize(messages: [ChatMessage], addBosToken: Bool) async -> Int
  var modelId: String { get }
}
getPromptTokensSize(messages:, addBosToken:) returns the prompt token count for a hypothetical generation against messages β€” useful for context-budget checks before a request lands.

Lifecycle

  • Use createConversation(systemPrompt:) for a fresh chat, or createConversationFromHistory(history:) to resume from persisted state.
  • Call unload() when you’re done. On iOS this is async; on Kotlin it’s a suspend function β€” both release native memory.
  • If the model runner is unloaded, any conversation it created becomes read-only.
Android lifecycle: If you need a model runner to survive activity destruction, wrap it in an Android Service. For most apps a ViewModel is sufficient β€” viewModelScope keeps the model alive across configuration changes and the cleanup pattern below unloads it on destruction.

Conversation

Conversation tracks chat state and exposes the streaming generation API. Instances are always created through a ModelRunner β€” don’t construct one directly.
Conversation is a Kotlin interface bridged to Swift as a protocol β€” the get-only properties surface as { get } in Swift. The generation methods return a SKIE-bridged SkieSwiftFlow<MessageResponse> (iterable with for try await):
public protocol Conversation {
  var modelRunner: ModelRunner { get }
  var history: [ChatMessage] { get }
  var functions: [LeapFunction] { get }
  var isGenerating: Bool { get }

  func registerFunction(function: LeapFunction)
  func registerFunctions(functions: [LeapFunction])
  func appendToHistory(message: ChatMessage)
  func removeLastMessage()
  func exportToJSON() -> String

  func generateResponse(
    userTextMessage: String,
    generationOptions: GenerationOptions?
  ) -> SkieSwiftFlow<MessageResponse>

  func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions?
  ) -> SkieSwiftFlow<MessageResponse>
}
Kotlin parameter defaults don’t propagate through Kotlin/Native, so the Swift method labels match the Kotlin parameter names (function:, functions:, message:) and generationOptions must be passed explicitly. A ConvenienceExtensions.swift overlay adds generateResponse(message:) without the options argument for the common case.
  • appendToHistory(message) β€” record a message without triggering generation. Useful for replaying persisted state, or for inserting tool-result messages (role: .tool) after handling a function call.
  • removeLastMessage() β€” pop the trailing message. No-op on an empty history. Useful when a generation was cancelled and you want to drop the dangling user turn.
  • registerFunctions(functions) β€” bulk-register tool definitions; equivalent to looping over registerFunction(_:).

Properties

  • history β€” a snapshot copy of the chat messages. Mutations don’t affect generation. Once the stream emits Complete, history includes the final assistant reply.
  • isGenerating β€” true while a generation is in flight. Starting a second generation while one is running is blocked.
  • functions β€” tool definitions the model may invoke. Registered through registerFunction(_:) / registerFunctions(_:) on both platforms.

Streaming generation

The async stream is the recommended way to drive generation β€” both platforms emit the same MessageResponse cases in the same order. Cancel the consuming task / coroutine to stop generation cleanly.
let user = ChatMessage(role: .user, textContent: "Hello! What can you do?")
let options = GenerationOptions()
  .with(temperature: 0.3)
  .with(minP: 0.15)
  .with(repetitionPenalty: 1.05)

Task {
  do {
    for try await response in conversation.generateResponse(
      message: user,
      generationOptions: options
    ) {
      switch onEnum(of: response) {
      case .chunk(let c):
        print(c.text, terminator: "")
      case .reasoningChunk(let r):
        print("Reasoning:", r.reasoning)
      case .functionCalls(let payload):
        handleFunctionCalls(payload.functionCalls)
      case .audioSample(let audio):
        // `audio.samples` is a `KotlinFloatArray` from Kotlin/Native β€” bridge to
        // `[Float]` via NSData if your renderer expects a Swift array:
        //   let nsData = LeapSDK.ArrayConversionsKt.floatArrayToNSData(array: audio.samples)
        //   let floats = nsData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) }
        audioRenderer.enqueue(audio.samples, sampleRate: Int(audio.sampleRate))
      case .complete(let completion):
        let text = completion.fullMessage.content.compactMap { part -> String? in
          if case let .text(t) = onEnum(of: part) { return t.text }
          return nil
        }.joined()
        print("\nComplete:", text)
        if let stats = completion.stats {
          print("Prompt tokens: \(stats.promptTokens), completion: \(stats.completionTokens)")
        }
      }
    }
  } catch {
    print("Generation failed: \(error)")
  }
}
onEnum(of:) (introduced in v0.10.0) gives exhaustive switching on Kotlin-bridged sealed types β€” the compiler errors if a new MessageResponse case is added.
Cancellation. Cancelling the Swift Task or the Kotlin coroutine Job stops generation and frees native resources. On both platforms cancellation is cooperative β€” the engine checks between tokens, so there’s at most one extra token of slack after cancel().

Export chat history

Persisting, replaying, or shipping the conversation to a cloud fallback all boil down to serializing conversation.history. Swift exposes exportToJSON() (returns a JSON string in OpenAI chat-completions shape); Kotlin uses kotlinx.serialization (ChatMessage and ChatMessageContent are @Serializable).
let jsonString: String = conversation.exportToJSON()

MessageResponse

A sealed type with one case per kind of incremental output the engine emits.
public enum MessageResponse {
  case chunk(Chunk)                        // Chunk.text β€” partial assistant text
  case reasoningChunk(ReasoningChunk)      // ReasoningChunk.reasoning β€” thinking tokens
  case functionCalls(FunctionCalls)        // FunctionCalls.functionCalls β€” [LeapFunctionCall]
  case audioSample(AudioSample)            // AudioSample.samples, .sampleRate β€” PCM frames
  case complete(Complete)                  // Complete.fullMessage, .finishReason, .stats
}
Each case wraps a small struct so SKIE can bridge Kotlin sealed classes losslessly. Use onEnum(of:) for exhaustive switching.
  • Chunk β€” partial assistant text. Append to your UI buffer.
  • ReasoningChunk β€” thinking-style tokens emitted by reasoning models (wrapped between <think> / </think> upstream). Only fires when GenerationOptions.enableThinking = true and the model supports it.
  • FunctionCalls β€” one or more tool invocations the model wants you to execute. See Function Calling.
  • AudioSample β€” float32 mono PCM frames from audio-capable checkpoints. The sample rate is constant for a generation; route the frames to a renderer.
  • Complete β€” final marker. fullMessage is the assembled assistant ChatMessage (also present in conversation.history). stats is nullable (GenerationStats?); when present it holds promptTokens, completionTokens, totalTokens, tokenPerSecond (non-nullable Float), and cachedPromptTokens.

GenerationFinishReason

Complete.finishReason is one of:
ValueMeaning
STOPThe model emitted its EOS token β€” clean completion.
EXCEED_CONTEXTThe model hit the context-window limit before stopping. The reply may be truncated mid-sentence.
INTERRUPTEDGeneration was cancelled by the caller (collector cancelled the flow / task).
CONSTRAINTA constrained-generation constraint (e.g. JSON schema) forced an early stop.
ERRORAn internal error occurred. The partial fullMessage is not appended to history β€” your error handler should run instead.

GenerationOptions

Tune sampling, structured output, tool-call parsing, and reasoning behavior per request. Leave any field as null to fall back to the model bundle’s defaults.
GenerationOptions is a Kotlin data class bridged into Swift. Kotlin parameter defaults don’t survive the ObjC bridge, so the canonical Swift idiom is the parameterless init plus chained .with(...) builders from ConvenienceExtensions.swift:
public class GenerationOptions {
  public var temperature: Float?
  public var topP: Float?
  public var minP: Float?
  public var repetitionPenalty: Float?
  public var topK: Int32?
  public var rngSeed: Int64?
  public var jsonSchemaConstraint: String?
  public var functionCallParser: LeapFunctionCallParser?
  public var injectSchemaIntoPrompt: Bool        // default true
  public var maxTokens: Int32?
  public var inlineThinkingTags: Bool            // default false
  public var enableThinking: Bool                // default false
  public var extras: String?

  public convenience init()                      // builder entry point

  // Builders (chainable):
  public func with(temperature: Float) -> GenerationOptions
  public func with(topP: Float) -> GenerationOptions
  public func with(minP: Float) -> GenerationOptions
  public func with(repetitionPenalty: Float) -> GenerationOptions
  public func with(topK: Int32) -> GenerationOptions
  public func with(rngSeed: Int64) -> GenerationOptions
  public func with(jsonSchema: String) -> GenerationOptions
  public func with(maxTokens: Int32) -> GenerationOptions
  public func with(injectSchemaIntoPrompt: Bool) -> GenerationOptions
  public func with(inlineThinkingTags: Bool) -> GenerationOptions
  public func with(enableThinking: Bool) -> GenerationOptions
}
For constrained generation, pass the schema string produced by the @Generatable macro into the JSON-schema builder:
let options = GenerationOptions()
    .with(temperature: 0.3)
    .with(minP: 0.15)
    .with(repetitionPenalty: 1.05)
    .with(maxTokens: 512)
    .with(jsonSchema: CityFact.jsonSchema())
The Apple-only GenerationOptionsCompat sibling type (used by legacy Leap.load(...) flows) additionally exposes setResponseFormat(jsonSchema: String) β€” see Constrained Generation.
  • Sampling fields (temperature, topP, minP, topK, repetitionPenalty) β€” standard sampling knobs. Use the values from the LEAP bundle manifest (sampling_parameters under generation_time_parameters in each model’s <Quant>.json on LiquidAI/LeapBundles); they’re tuned per checkpoint by the training team and differ from the HF model card defaults (the manifest values are the llama.cpp-engine path the SDK runs). Arbitrary β€œ0.7” defaults from generic AI tutorials usually underperform.
  • rngSeed β€” set for deterministic / reproducible output (testing, debugging). Default is non-deterministic.
  • maxTokens β€” cap the response length. The model stops after this many completion tokens (prompt tokens don’t count). Defaults to β€œuntil EOS or context limit.” Useful for cost control with constrained output.
  • jsonSchemaConstraint β€” JSON Schema string for constrained generation. Use the higher-level helpers β€” Swift options.with(jsonSchema: T.jsonSchema()) (or GenerationOptionsCompat.setResponseFormat(jsonSchema:)) / Kotlin setResponseFormatType<T>() β€” with @Generatable types. See Constrained Generation.
  • injectSchemaIntoPrompt β€” when true (default), the schema is appended to the system message for semantic guidance in addition to the structural constraint at decode time. Set false to skip the prompt injection (matches llama-server grammar mode) β€” saves prompt tokens for large schemas.
  • functionCallParser β€” picks the tokenizer expected by the model. LFMFunctionCallParser (default) for Liquid Foundation Models; HermesFunctionCallParser() for Hermes/Qwen3 formats; null to receive raw tool-call text in Chunks.
  • enableThinking β€” turn on reasoning mode for models that support it (e.g. LFM2.5-Thinking). Reasoning tokens arrive as ReasoningChunks.
  • inlineThinkingTags β€” when true, thinking tokens are emitted as ordinary Chunks with the literal <think>...</think> tags intact (instead of ReasoningChunk). ChatMessage.reasoningContent is still populated on the final message.
  • extras β€” backend-specific JSON payload (internal use).

GenerationStats

promptTokens         Long    Prompt tokens computed (excludes tokens restored from KV cache).
completionTokens     Long    Tokens emitted during generation.
totalTokens          Long    promptTokens + completionTokens (excludes cached tokens).
tokenPerSecond       Float   Generation throughput (may be approximate on some backends).
cachedPromptTokens   Long    Prompt tokens restored from KV cache β€” not recomputed. 0 when the
                             cache is disabled or missed.
cachedPromptTokens is useful for observing KV-cache effectiveness β€” a high ratio of cached tokens to total prompt tokens means the prefix matched and you skipped the prefill compute for those tokens.