Conversation & Generation

All functions documented on this page are safe to call from the main/UI thread; callbacks run on the main thread unless explicitly noted. The API surface is identical across iOS, macOS, Android, JVM, and Kotlin/Native — only the language and a handful of platform conventions differ.

`ModelRunner`

A ModelRunner represents a loaded model instance. Obtain one via:

Android (recommended): LeapModelDownloader.loadModel(...) / loadSimpleModel(...) — one-shot load that transparently routes through the optional Leap Model Service when installed, and adds WorkManager-backed background download staging on top.
iOS / macOS (recommended): ModelDownloader.loadModel(...) / loadSimpleModel(...) — one-shot load that routes file transfers through URLSession. Pass sessionConfiguration: .background(withIdentifier:) for downloads that survive app suspension. (Class ships in the LeapModelDownloader SPM library product.)
All platforms (iOS, Android, JVM, Linux native, Windows native, macOS Kotlin): LeapDownloader.loadModel(...) / loadSimpleModel(...) — the cross-platform manifest loader, with no platform-native background integration. Used directly on JVM/native and as the underlying loader inside both the iOS ModelDownloader and Android LeapModelDownloader.

Hold a strong reference for as long as you need to perform generations, then call unload() to release native resources. See Model Loading for full reference.

Swift (iOS / macOS)
Kotlin (all platforms)

public protocol ModelRunner {
  func createConversation(systemPrompt: String?) -> Conversation
  func createConversationFromHistory(history: [ChatMessage]) -> Conversation
  func unload() async
  func getPromptTokensSize(messages: [ChatMessage], addBosToken: Bool) async -> Int
  var modelId: String { get }
}

interface ModelRunner {
  val modelId: String
  fun createConversation(systemPrompt: String? = null): Conversation
  fun createConversationFromHistory(history: List<ChatMessage>): Conversation
  suspend fun unload()
  suspend fun getPromptTokensSize(messages: List<ChatMessage>, addBosToken: Boolean = true): Int
}

getPromptTokensSize(messages:, addBosToken:) returns the prompt token count for a hypothetical generation against messages — useful for context-budget checks before a request lands.

Lifecycle

Use createConversation(systemPrompt:) for a fresh chat, or createConversationFromHistory(history:) to resume from persisted state.
Call unload() when you’re done. On iOS this is async; on Kotlin it’s a suspend function — both release native memory.
If the model runner is unloaded, any conversation it created becomes read-only.

Android lifecycle: If you need a model runner to survive activity destruction, wrap it in an Android Service. For most apps a ViewModel is sufficient — viewModelScope keeps the model alive across configuration changes and the cleanup pattern below unloads it on destruction.

`Conversation`

Conversation tracks chat state and exposes the streaming generation API. Instances are always created through a ModelRunner — don’t construct one directly.

Swift (iOS / macOS)
Kotlin (all platforms)

Conversation is a Kotlin interface bridged to Swift as a protocol — the get-only properties surface as { get } in Swift. The generation methods return a SKIE-bridged SkieSwiftFlow<MessageResponse> (iterable with for try await):

public protocol Conversation {
  var modelRunner: ModelRunner { get }
  var history: [ChatMessage] { get }
  var functions: [LeapFunction] { get }
  var isGenerating: Bool { get }

  func registerFunction(function: LeapFunction)
  func registerFunctions(functions: [LeapFunction])
  func appendToHistory(message: ChatMessage)
  func removeLastMessage()
  func exportToJSON() -> String

  func generateResponse(
    userTextMessage: String,
    generationOptions: GenerationOptions?
  ) -> SkieSwiftFlow<MessageResponse>

  func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions?
  ) -> SkieSwiftFlow<MessageResponse>
}

Kotlin parameter defaults don’t propagate through Kotlin/Native, so the Swift method labels match the Kotlin parameter names (function:, functions:, message:) and generationOptions must be passed explicitly. A ConvenienceExtensions.swift overlay adds generateResponse(message:) without the options argument for the common case.

interface Conversation {
  val modelRunner: ModelRunner
  val history: List<ChatMessage>
  val functions: List<LeapFunction>
  val isGenerating: Boolean

  fun appendToHistory(message: ChatMessage)
  fun removeLastMessage()

  fun registerFunction(function: LeapFunction)
  fun registerFunctions(functions: List<LeapFunction>)

  fun generateResponse(
      userTextMessage: String,
      generationOptions: GenerationOptions? = null
  ): Flow<MessageResponse>

  fun generateResponse(
      message: ChatMessage,
      generationOptions: GenerationOptions? = null
  ): Flow<MessageResponse>
}

appendToHistory(message) — record a message without triggering generation. Useful for replaying persisted state, or for inserting tool-result messages (role: .tool) after handling a function call.
removeLastMessage() — pop the trailing message. No-op on an empty history. Useful when a generation was cancelled and you want to drop the dangling user turn.
registerFunctions(functions) — bulk-register tool definitions; equivalent to looping over registerFunction(_:).

Properties

history — a snapshot copy of the chat messages. Mutations don’t affect generation. Once the stream emits Complete, history includes the final assistant reply.
isGenerating — true while a generation is in flight. Starting a second generation while one is running is blocked.
functions — tool definitions the model may invoke. Registered through registerFunction(_:) / registerFunctions(_:) on both platforms.

Streaming generation

The async stream is the recommended way to drive generation — both platforms emit the same MessageResponse cases in the same order. Cancel the consuming task / coroutine to stop generation cleanly.

Swift (iOS / macOS)
Kotlin (all platforms)

let user = ChatMessage(role: .user, textContent: "Hello! What can you do?")
let options = GenerationOptions()
  .with(temperature: 0.3)
  .with(minP: 0.15)
  .with(repetitionPenalty: 1.05)

Task {
  do {
    for try await response in conversation.generateResponse(
      message: user,
      generationOptions: options
    ) {
      switch onEnum(of: response) {
      case .chunk(let c):
        print(c.text, terminator: "")
      case .reasoningChunk(let r):
        print("Reasoning:", r.reasoning)
      case .functionCalls(let payload):
        handleFunctionCalls(payload.functionCalls)
      case .audioSample(let audio):
        // `audio.samples` is a `KotlinFloatArray` from Kotlin/Native — bridge to
        // `[Float]` via NSData if your renderer expects a Swift array:
        //   let nsData = LeapSDK.ArrayConversionsKt.floatArrayToNSData(array: audio.samples)
        //   let floats = nsData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) }
        audioRenderer.enqueue(audio.samples, sampleRate: Int(audio.sampleRate))
      case .complete(let completion):
        let text = completion.fullMessage.content.compactMap { part -> String? in
          if case let .text(t) = onEnum(of: part) { return t.text }
          return nil
        }.joined()
        print("\nComplete:", text)
        if let stats = completion.stats {
          print("Prompt tokens: \(stats.promptTokens), completion: \(stats.completionTokens)")
        }
      }
    }
  } catch {
    print("Generation failed: \(error)")
  }
}

onEnum(of:) (introduced in v0.10.0) gives exhaustive switching on Kotlin-bridged sealed types — the compiler errors if a new MessageResponse case is added.

class ChatViewModel(application: Application) : AndroidViewModel(application) {
    private var conversation: Conversation? = null
    private var modelRunner: ModelRunner? = null
    private var generationJob: Job? = null

    private val _generatedText = MutableStateFlow("")
    val generatedText: StateFlow<String> = _generatedText.asStateFlow()

    fun generateResponse(userInput: String) {
        generationJob = viewModelScope.launch {
            _generatedText.value = ""
            conversation?.generateResponse(userInput)
                ?.onEach { response ->
                    when (response) {
                        is MessageResponse.Chunk -> _generatedText.value += response.text
                        is MessageResponse.ReasoningChunk -> Log.d(TAG, "Reasoning: ${response.reasoning}")
                        is MessageResponse.FunctionCalls -> handleFunctionCalls(response.functionCalls)
                        is MessageResponse.AudioSample -> audioRenderer.enqueue(response.samples, response.sampleRate)
                        is MessageResponse.Complete -> Log.d(TAG, "Done. Stats: ${response.stats}")
                    }
                }
                ?.catch { e -> Log.e(TAG, "Generation failed", e) }
                ?.collect()
        }
    }

    fun stopGeneration() { generationJob?.cancel(); generationJob = null }

    override fun onCleared() {
        super.onCleared()
        generationJob?.cancel()
        runBlocking(Dispatchers.IO) { modelRunner?.unload() }
    }
}

Errors propagate as LeapGenerationException through the flow — handle with .catch { ... }.

Cancellation. Cancelling the Swift Task or the Kotlin coroutine Job stops generation and frees native resources. On both platforms cancellation is cooperative — the engine checks between tokens, so there’s at most one extra token of slack after cancel().

Export chat history

Persisting, replaying, or shipping the conversation to a cloud fallback all boil down to serializing conversation.history. Swift exposes exportToJSON() (returns a JSON string in OpenAI chat-completions shape); Kotlin uses kotlinx.serialization (ChatMessage and ChatMessageContent are @Serializable).

Swift (iOS / macOS)
Kotlin (all platforms)

let jsonString: String = conversation.exportToJSON()

import kotlinx.serialization.json.Json
import kotlinx.serialization.encodeToString

val jsonString = Json.encodeToString(conversation.history)

Add org.jetbrains.kotlinx:kotlinx-serialization-json to your dependencies — see Utilities → Serialization for the round-trip pattern.

`MessageResponse`

A sealed type with one case per kind of incremental output the engine emits.

Swift (iOS / macOS)
Kotlin (all platforms)

public enum MessageResponse {
  case chunk(Chunk)                        // Chunk.text — partial assistant text
  case reasoningChunk(ReasoningChunk)      // ReasoningChunk.reasoning — thinking tokens
  case functionCalls(FunctionCalls)        // FunctionCalls.functionCalls — [LeapFunctionCall]
  case audioSample(AudioSample)            // AudioSample.samples, .sampleRate — PCM frames
  case complete(Complete)                  // Complete.fullMessage, .finishReason, .stats
}

Each case wraps a small struct so SKIE can bridge Kotlin sealed classes losslessly. Use onEnum(of:) for exhaustive switching.

sealed interface MessageResponse {
  data class Chunk(val text: String) : MessageResponse
  data class ReasoningChunk(val reasoning: String) : MessageResponse
  data class FunctionCalls(val functionCalls: List<LeapFunctionCall>) : MessageResponse
  data class AudioSample(val samples: FloatArray, val sampleRate: Int) : MessageResponse
  data class Complete(
    val fullMessage: ChatMessage,
    val finishReason: GenerationFinishReason,
    val stats: GenerationStats?,
  ) : MessageResponse
}

Chunk — partial assistant text. Append to your UI buffer.
ReasoningChunk — thinking-style tokens emitted by reasoning models (wrapped between <think> / </think> upstream). Only fires when GenerationOptions.enableThinking = true and the model supports it.
FunctionCalls — one or more tool invocations the model wants you to execute. See Function Calling.
AudioSample — float32 mono PCM frames from audio-capable checkpoints. The sample rate is constant for a generation; route the frames to a renderer.
Complete — final marker. fullMessage is the assembled assistant ChatMessage (also present in conversation.history). stats is nullable (GenerationStats?); when present it holds promptTokens, completionTokens, totalTokens, tokenPerSecond (non-nullable Float), and cachedPromptTokens.

`GenerationFinishReason`

Complete.finishReason is one of:

Value	Meaning
`STOP`	The model emitted its EOS token — clean completion.
`EXCEED_CONTEXT`	The model hit the context-window limit before stopping. The reply may be truncated mid-sentence.
`INTERRUPTED`	Generation was cancelled by the caller (collector cancelled the flow / task).
`CONSTRAINT`	A constrained-generation constraint (e.g. JSON schema) forced an early stop.
`ERROR`	An internal error occurred. The partial `fullMessage` is not appended to `history` — your error handler should run instead.

`GenerationOptions`

Tune sampling, structured output, tool-call parsing, and reasoning behavior per request. Leave any field as null to fall back to the model bundle’s defaults.

Swift (iOS / macOS)
Kotlin (all platforms)

GenerationOptions is a Kotlin data class bridged into Swift. Kotlin parameter defaults don’t survive the ObjC bridge, so the canonical Swift idiom is the parameterless init plus chained .with(...) builders from ConvenienceExtensions.swift:

public class GenerationOptions {
  public var temperature: Float?
  public var topP: Float?
  public var minP: Float?
  public var repetitionPenalty: Float?
  public var topK: Int32?
  public var rngSeed: Int64?
  public var jsonSchemaConstraint: String?
  public var functionCallParser: LeapFunctionCallParser?
  public var injectSchemaIntoPrompt: Bool        // default true
  public var maxTokens: Int32?
  public var inlineThinkingTags: Bool            // default false
  public var enableThinking: Bool                // default false
  public var extras: String?

  public convenience init()                      // builder entry point

  // Builders (chainable):
  public func with(temperature: Float) -> GenerationOptions
  public func with(topP: Float) -> GenerationOptions
  public func with(minP: Float) -> GenerationOptions
  public func with(repetitionPenalty: Float) -> GenerationOptions
  public func with(topK: Int32) -> GenerationOptions
  public func with(rngSeed: Int64) -> GenerationOptions
  public func with(jsonSchema: String) -> GenerationOptions
  public func with(maxTokens: Int32) -> GenerationOptions
  public func with(injectSchemaIntoPrompt: Bool) -> GenerationOptions
  public func with(inlineThinkingTags: Bool) -> GenerationOptions
  public func with(enableThinking: Bool) -> GenerationOptions
}

For constrained generation, pass the schema string produced by the @Generatable macro into the JSON-schema builder:

let options = GenerationOptions()
    .with(temperature: 0.3)
    .with(minP: 0.15)
    .with(repetitionPenalty: 1.05)
    .with(maxTokens: 512)
    .with(jsonSchema: CityFact.jsonSchema())

The Apple-only GenerationOptionsCompat sibling type (used by legacy Leap.load(...) flows) additionally exposes setResponseFormat(jsonSchema: String) — see Constrained Generation.

data class GenerationOptions(
    var temperature: Float? = null,
    var topP: Float? = null,
    var minP: Float? = null,
    var repetitionPenalty: Float? = null,
    var topK: Int? = null,
    var rngSeed: Long? = null,
    var jsonSchemaConstraint: String? = null,
    var functionCallParser: LeapFunctionCallParser? = LFMFunctionCallParser(),
    var injectSchemaIntoPrompt: Boolean = true,
    var maxTokens: Int? = null,
    var inlineThinkingTags: Boolean = false,
    var enableThinking: Boolean = false,
    var extras: String? = null,
) {
  inline fun <reified T : Any> setResponseFormatType()

  companion object {
    fun build(buildAction: GenerationOptions.() -> Unit): GenerationOptions
  }
}

val options = GenerationOptions.build {
    temperature = 0.3f
    minP = 0.15f
    repetitionPenalty = 1.05f
    maxTokens = 512
    setResponseFormatType<CityFact>()
}

Sampling fields (temperature, topP, minP, topK, repetitionPenalty) — standard sampling knobs. Use the values from the LEAP bundle manifest (sampling_parameters under generation_time_parameters in each model’s <Quant>.json on LiquidAI/LeapBundles); they’re tuned per checkpoint by the training team and differ from the HF model card defaults (the manifest values are the llama.cpp-engine path the SDK runs). Arbitrary “0.7” defaults from generic AI tutorials usually underperform.
rngSeed — set for deterministic / reproducible output (testing, debugging). Default is non-deterministic.
maxTokens — cap the response length. The model stops after this many completion tokens (prompt tokens don’t count). Defaults to “until EOS or context limit.” Useful for cost control with constrained output.
jsonSchemaConstraint — JSON Schema string for constrained generation. Use the higher-level helpers — Swift options.with(jsonSchema: T.jsonSchema()) (or GenerationOptionsCompat.setResponseFormat(jsonSchema:)) / Kotlin setResponseFormatType<T>() — with @Generatable types. See Constrained Generation.
injectSchemaIntoPrompt — when true (default), the schema is appended to the system message for semantic guidance in addition to the structural constraint at decode time. Set false to skip the prompt injection (matches llama-server grammar mode) — saves prompt tokens for large schemas.
functionCallParser — picks the tokenizer expected by the model. LFMFunctionCallParser (default) for Liquid Foundation Models; HermesFunctionCallParser() for Hermes/Qwen3 formats; null to receive raw tool-call text in Chunks.
enableThinking — turn on reasoning mode for models that support it (e.g. LFM2.5-Thinking). Reasoning tokens arrive as ReasoningChunks.
inlineThinkingTags — when true, thinking tokens are emitted as ordinary Chunks with the literal <think>...</think> tags intact (instead of ReasoningChunk). ChatMessage.reasoningContent is still populated on the final message.
extras — backend-specific JSON payload (internal use).

`GenerationStats`

promptTokens         Long    Prompt tokens computed (excludes tokens restored from KV cache).
completionTokens     Long    Tokens emitted during generation.
totalTokens          Long    promptTokens + completionTokens (excludes cached tokens).
tokenPerSecond       Float   Generation throughput (may be approximate on some backends).
cachedPromptTokens   Long    Prompt tokens restored from KV cache — not recomputed. 0 when the
                             cache is disabled or missed.

cachedPromptTokens is useful for observing KV-cache effectiveness — a high ratio of cached tokens to total prompt tokens means the prefix matched and you skipped the prefill compute for those tokens.

Leap SDK

Model Bundling Services

Conversation & Generation

`ModelRunner`

Lifecycle

`Conversation`

Properties

Streaming generation

Export chat history

`MessageResponse`

`GenerationFinishReason`

`GenerationOptions`

`GenerationStats`

Leap SDK

Model Bundling Services

Documentation Index

​ModelRunner

​Lifecycle

​Conversation

​Properties

​Streaming generation

​Export chat history

​MessageResponse

​GenerationFinishReason

​GenerationOptions

​GenerationStats

`ModelRunner`

Lifecycle

`Conversation`

Properties

Streaming generation

Export chat history

`MessageResponse`

`GenerationFinishReason`

`GenerationOptions`

`GenerationStats`