NativeApi

candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.

eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.

m Int32
The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.

mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.

Returns

llama_sample_token_mirostat_v2(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single&)

Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

public static LLamaToken llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)

Parameters

candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.

eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.

mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.

Returns

llama_sample_token_greedy(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Selects the token with the highest probability.

public static LLamaToken llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

Returns

llama_sample_token(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Randomly selects a token from the candidates based on their probabilities.

public static LLamaToken llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

Returns

<llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle)

internal static Single* <llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle ctx)

Parameters

Returns

**<llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle, LLamaToken, Byte*, Int32)**

internal static int <llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle model, LLamaToken llamaToken, Byte* buffer, int length)

Parameters

llamaToken LLamaToken

buffer Byte*

length Int32

Returns

<TryLoadLibraries>g__TryLoad|84_0(String)

internal static IntPtr <TryLoadLibraries>g__TryLoad|84_0(string path)

Parameters

path String

Returns

<TryLoadLibraries>g__TryFindPath|84_1(String, <>c__DisplayClass84_0&)

internal static string <TryLoadLibraries>g__TryFindPath|84_1(string filename, <>c__DisplayClass84_0& )

Parameters

filename String

`` <>c__DisplayClass84_0&

Returns

String

llama_set_n_threads(SafeLLamaContextHandle, UInt32, UInt32)

Set the number of threads used for decoding

public static void llama_set_n_threads(SafeLLamaContextHandle ctx, uint n_threads, uint n_threads_batch)

Parameters

n_threads UInt32
n_threads is the number of threads used for generation (single token)

n_threads_batch UInt32
n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)

llama_vocab_type(SafeLlamaModelHandle)

public static LLamaVocabType llama_vocab_type(SafeLlamaModelHandle model)

Parameters

Returns

LLamaVocabType

llama_rope_type(SafeLlamaModelHandle)

public static LLamaRopeType llama_rope_type(SafeLlamaModelHandle model)

Parameters

Returns

LLamaRopeType

llama_grammar_init(LLamaGrammarElement, UInt64, UInt64)**

Create a new grammar from the given set of grammar rules

public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)

Parameters

rules LLamaGrammarElement**

n_rules UInt64

start_rule_index UInt64

Returns

llama_grammar_free(IntPtr)

Free all memory from the given SafeLLamaGrammarHandle

public static void llama_grammar_free(IntPtr grammar)

Parameters

grammar IntPtr

llama_grammar_copy(SafeLLamaGrammarHandle)

Create a copy of an existing grammar instance

public static IntPtr llama_grammar_copy(SafeLLamaGrammarHandle grammar)

Parameters

grammar SafeLLamaGrammarHandle

Returns

llama_sample_grammar(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaGrammarHandle)

Apply constraints from grammar

public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)

Parameters

candidates LLamaTokenDataArrayNative&

grammar SafeLLamaGrammarHandle

llama_grammar_accept_token(SafeLLamaContextHandle, SafeLLamaGrammarHandle, LLamaToken)

Accepts the sampled token into the grammar

public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, LLamaToken token)

Parameters

grammar SafeLLamaGrammarHandle

token LLamaToken

llava_validate_embed_size(SafeLLamaContextHandle, SafeLlavaModelHandle)

Sanity check for clip <-> llava embed size match

public static bool llava_validate_embed_size(SafeLLamaContextHandle ctxLlama, SafeLlavaModelHandle ctxClip)

Parameters

ctxLlama SafeLLamaContextHandle
LLama Context

ctxClip SafeLlavaModelHandle
Llava Model

Returns

Boolean
True if validate successfully

llava_image_embed_make_with_bytes(SafeLlavaModelHandle, Int32, Byte[], Int32)

Build an image embed from image file bytes

public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_bytes(SafeLlavaModelHandle ctx_clip, int n_threads, Byte[] image_bytes, int image_bytes_length)

Parameters

ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model

n_threads Int32
Number of threads

image_bytes Byte[]
Binary image in jpeg format

image_bytes_length Int32
Bytes lenght of the image

Returns

SafeLlavaImageEmbedHandle
SafeHandle to the Embeddings

llava_image_embed_make_with_filename(SafeLlavaModelHandle, Int32, String)

Build an image embed from a path to an image filename

public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_filename(SafeLlavaModelHandle ctx_clip, int n_threads, string image_path)

Parameters

ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model

n_threads Int32
Number of threads

image_path String
Image filename (jpeg) to generate embeddings

Returns

SafeLlavaImageEmbedHandle
SafeHandel to the embeddings

llava_image_embed_free(IntPtr)

Free an embedding made with llava_image_embed_make_*

public static void llava_image_embed_free(IntPtr embed)

Parameters

embed IntPtr
Embeddings to release

llava_eval_image_embed(SafeLLamaContextHandle, SafeLlavaImageEmbedHandle, Int32, Int32&)

Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed.

public static bool llava_eval_image_embed(SafeLLamaContextHandle ctx_llama, SafeLlavaImageEmbedHandle embed, int n_batch, Int32& n_past)

Parameters

ctx_llama SafeLLamaContextHandle
Llama Context

embed SafeLlavaImageEmbedHandle
Embedding handle

n_batch Int32

n_past Int32&

Returns

Boolean
True on success

**llama_model_quantize(String, String, LLamaModelQuantizeParams*)**

Returns 0 on success

public static uint llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)

Parameters

fname_inp String

fname_out String

param LLamaModelQuantizeParams*

Returns

UInt32
Returns 0 on success

**llama_sample_repetition_penalties(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, LLamaToken*, UInt64, Single, Single, Single)**

Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix. Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.

public static void llama_sample_repetition_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, LLamaToken* last_tokens, ulong last_tokens_size, float penalty_repeat, float penalty_freq, float penalty_present)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

last_tokens LLamaToken*

last_tokens_size UInt64

penalty_repeat Single
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.

penalty_freq Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.

penalty_present Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.

llama_sample_apply_guidance(SafeLLamaContextHandle, Span<Single>, ReadOnlySpan<Single>, Single)

Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806

public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Span<float> logits, ReadOnlySpan<float> logits_guidance, float scale)

Parameters

logits Span<Single>
Logits extracted from the original generation context.

logits_guidance ReadOnlySpan<Single>
Logits extracted from a separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.

scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.

llama_sample_apply_guidance(SafeLLamaContextHandle, Single, Single, Single)

Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806

public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Single* logits, Single* logits_guidance, float scale)

Parameters

logits Single*
Logits extracted from the original generation context.

logits_guidance Single*
Logits extracted from a separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.

scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.

llama_sample_softmax(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)

Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.

public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

llama_sample_top_k(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64)

Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

k Int32

min_keep UInt64

llama_sample_top_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_min_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841

public static void llama_sample_min_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_tail_free(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.

public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

z Single

min_keep UInt64

llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)

Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.

public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single)

Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772.

public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float min_temp, float max_temp, float exponent_val)

Parameters

candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray

min_temp Single

max_temp Single

exponent_val Single

llama_sample_temp(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single)

Modify logits by temperature

public static void llama_sample_temp(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)

Parameters

candidates LLamaTokenDataArrayNative&

temp Single

llama_get_embeddings(SafeLLamaContextHandle)

Get the embeddings for the input

public static Span<float> llama_get_embeddings(SafeLLamaContextHandle ctx)

Parameters

Returns

Span<Single>

**llama_chat_apply_template(SafeLlamaModelHandle, Char, LLamaChatMessage, IntPtr, Boolean, Char*, Int32)**

Apply chat template. Inspired by hf apply_chat_template() on python. Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model" NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

public static int llama_chat_apply_template(SafeLlamaModelHandle model, Char* tmpl, LLamaChatMessage* chat, IntPtr n_msg, bool add_ass, Char* buf, int length)

Parameters

tmpl Char*
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.

chat LLamaChatMessage*
Pointer to a list of multiple llama_chat_message

n_msg IntPtr
Number of llama_chat_message in this chat

add_ass Boolean
Whether to end the prompt with the token(s) that indicate the start of an assistant message.

buf Char*
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)

length Int32
The size of the allocated buffer

Returns

Int32
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.

llama_token_bos(SafeLlamaModelHandle)

Get the "Beginning of sentence" token

public static LLamaToken llama_token_bos(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_eos(SafeLlamaModelHandle)

Get the "End of sentence" token

public static LLamaToken llama_token_eos(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_nl(SafeLlamaModelHandle)

Get the "new line" token

public static LLamaToken llama_token_nl(SafeLlamaModelHandle model)

Parameters

Returns

llama_add_bos_token(SafeLlamaModelHandle)

Returns -1 if unknown, 1 for true or 0 for false.

public static int llama_add_bos_token(SafeLlamaModelHandle model)

Parameters

Returns

llama_add_eos_token(SafeLlamaModelHandle)

Returns -1 if unknown, 1 for true or 0 for false.

public static int llama_add_eos_token(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_prefix(SafeLlamaModelHandle)

codellama infill tokens, Beginning of infill prefix

public static int llama_token_prefix(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_middle(SafeLlamaModelHandle)

codellama infill tokens, Beginning of infill middle

public static int llama_token_middle(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_suffix(SafeLlamaModelHandle)

codellama infill tokens, Beginning of infill suffix

public static int llama_token_suffix(SafeLlamaModelHandle model)

Parameters

Returns

llama_token_eot(SafeLlamaModelHandle)

codellama infill tokens, End of infill middle

public static int llama_token_eot(SafeLlamaModelHandle model)

Parameters

Returns

llama_print_timings(SafeLLamaContextHandle)

Print out timing information for this context

public static void llama_print_timings(SafeLLamaContextHandle ctx)

Parameters

llama_reset_timings(SafeLLamaContextHandle)

Reset all collected timing information for this context

public static void llama_reset_timings(SafeLLamaContextHandle ctx)

Parameters

llama_print_system_info()

Print system information

public static IntPtr llama_print_system_info()

Returns

llama_token_to_piece(SafeLlamaModelHandle, LLamaToken, Span<Byte>)

Convert a single token into text

public static int llama_token_to_piece(SafeLlamaModelHandle model, LLamaToken llamaToken, Span<byte> buffer)

Parameters

llamaToken LLamaToken

buffer Span<Byte>
buffer to write string into

Returns

Int32
The length written, or if the buffer is too small a negative that indicates the length required

llama_tokenize(SafeLlamaModelHandle, Byte, Int32, LLamaToken, Int32, Boolean, Boolean)

Convert text into tokens

public static int llama_tokenize(SafeLlamaModelHandle model, Byte* text, int text_len, LLamaToken* tokens, int n_max_tokens, bool add_bos, bool special)

Parameters

text Byte*

text_len Int32

tokens LLamaToken*

n_max_tokens Int32

add_bos Boolean

special Boolean
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.

Returns

Int32
Returns the number of tokens on success, no more than n_max_tokens. Returns a negative number on failure - the number of tokens that would have been returned

llama_log_set(LLamaLogCallback)

Register a callback to receive llama log messages

public static void llama_log_set(LLamaLogCallback logCallback)

Parameters

logCallback LLamaLogCallback

llama_kv_cache_clear(SafeLLamaContextHandle)

Clear the KV cache

public static void llama_kv_cache_clear(SafeLLamaContextHandle ctx)

Parameters

llama_kv_cache_seq_rm(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos)

Removes all tokens that belong to the specified sequence and have positions in [p0, p1)

public static void llama_kv_cache_seq_rm(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1)

Parameters

seq LLamaSeqId

p0 LLamaPos

p1 LLamaPos

llama_kv_cache_seq_cp(SafeLLamaContextHandle, LLamaSeqId, LLamaSeqId, LLamaPos, LLamaPos)

Copy all tokens that belong to the specified sequence to another sequence Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence

public static void llama_kv_cache_seq_cp(SafeLLamaContextHandle ctx, LLamaSeqId src, LLamaSeqId dest, LLamaPos p0, LLamaPos p1)

Parameters

src LLamaSeqId

dest LLamaSeqId

p0 LLamaPos

p1 LLamaPos

llama_kv_cache_seq_keep(SafeLLamaContextHandle, LLamaSeqId)

Removes all tokens that do not belong to the specified sequence

public static void llama_kv_cache_seq_keep(SafeLLamaContextHandle ctx, LLamaSeqId seq)

Parameters

seq LLamaSeqId

llama_kv_cache_seq_add(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)

Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1) If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()

public static void llama_kv_cache_seq_add(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int delta)

Parameters

seq LLamaSeqId

p0 LLamaPos

p1 LLamaPos

delta Int32

llama_kv_cache_seq_div(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)

Integer division of the positions by factor of d > 1 If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)

public static void llama_kv_cache_seq_div(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int d)

Parameters

seq LLamaSeqId

p0 LLamaPos

p1 LLamaPos

d Int32

llama_kv_cache_seq_pos_max(SafeLLamaContextHandle, LLamaSeqId)

Returns the largest position present in the KV cache for the specified sequence

public static LLamaPos llama_kv_cache_seq_pos_max(SafeLLamaContextHandle ctx, LLamaSeqId seq)

Parameters

seq LLamaSeqId

Returns

LLamaPos

llama_kv_cache_defrag(SafeLLamaContextHandle)

Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()

public static LLamaPos llama_kv_cache_defrag(SafeLLamaContextHandle ctx)

Parameters

Returns

LLamaPos

llama_kv_cache_update(SafeLLamaContextHandle)

Apply the KV cache updates (such as K-shifts, defragmentation, etc.)

public static void llama_kv_cache_update(SafeLLamaContextHandle ctx)

Parameters

llama_batch_init(Int32, Int32, Int32)

Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized

public static LLamaNativeBatch llama_batch_init(int n_tokens, int embd, int n_seq_max)

Parameters

n_tokens Int32

embd Int32

n_seq_max Int32
Each token can be assigned up to n_seq_max sequence ids

Returns

LLamaNativeBatch

llama_batch_free(LLamaNativeBatch)

Frees a batch of tokens allocated with llama_batch_init()

public static void llama_batch_free(LLamaNativeBatch batch)

Parameters

batch LLamaNativeBatch

llama_decode(SafeLLamaContextHandle, LLamaNativeBatch)

public static int llama_decode(SafeLLamaContextHandle ctx, LLamaNativeBatch batch)

Parameters

batch LLamaNativeBatch

Returns

Int32
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error

llama_kv_cache_view_init(SafeLLamaContextHandle, Int32)

Create an empty KV cache view. (use only for debugging purposes)

public static LLamaKvCacheView llama_kv_cache_view_init(SafeLLamaContextHandle ctx, int n_max_seq)

Parameters

n_max_seq Int32

Returns

LLamaKvCacheView

llama_kv_cache_view_free(LLamaKvCacheView&)

Free a KV cache view. (use only for debugging purposes)

public static void llama_kv_cache_view_free(LLamaKvCacheView& view)

Parameters

view LLamaKvCacheView&

llama_kv_cache_view_update(SafeLLamaContextHandle, LLamaKvCacheView&)

Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)

public static void llama_kv_cache_view_update(SafeLLamaContextHandle ctx, LLamaKvCacheView& view)

Parameters

view LLamaKvCacheView&

llama_get_kv_cache_token_count(SafeLLamaContextHandle)

Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times

public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_get_kv_cache_used_cells(SafeLLamaContextHandle)

Returns the number of used KV cells (i.e. have at least one sequence assigned to them)

public static int llama_get_kv_cache_used_cells(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_beam_search(SafeLLamaContextHandle, LLamaBeamSearchCallback, IntPtr, UInt64, Int32, Int32, Int32)

Deterministically returns entire sentence constructed by a beam search.

public static void llama_beam_search(SafeLLamaContextHandle ctx, LLamaBeamSearchCallback callback, IntPtr callback_data, ulong n_beams, int n_past, int n_predict, int n_threads)

Parameters

ctx SafeLLamaContextHandle
Pointer to the llama_context.

callback LLamaBeamSearchCallback
Invoked for each iteration of the beam_search loop, passing in beams_state.

callback_data IntPtr
A pointer that is simply passed back to callback.

n_beams UInt64
Number of beams to use.

n_past Int32
Number of tokens already evaluated.

n_predict Int32
Maximum number of tokens to predict. EOS may occur earlier.

n_threads Int32
Number of threads.

llama_empty_call()

A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.

public static void llama_empty_call()

llama_max_devices()

Get the maximum number of devices supported by llama.cpp

public static long llama_max_devices()

Returns

Int64

llama_model_default_params()

Create a LLamaModelParams with default values

public static LLamaModelParams llama_model_default_params()

Returns

LLamaModelParams

llama_context_default_params()

Create a LLamaContextParams with default values

public static LLamaContextParams llama_context_default_params()

Returns

LLamaContextParams

llama_model_quantize_default_params()

Create a LLamaModelQuantizeParams with default values

public static LLamaModelQuantizeParams llama_model_quantize_default_params()

Returns

LLamaModelQuantizeParams

llama_supports_mmap()

Check if memory mapping is supported

public static bool llama_supports_mmap()

Returns

llama_supports_mlock()

Check if memory locking is supported

public static bool llama_supports_mlock()

Returns

llama_supports_gpu_offload()

Check if GPU offload is supported

public static bool llama_supports_gpu_offload()

Returns

llama_set_rng_seed(SafeLLamaContextHandle, UInt32)

Sets the current rng seed.

public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, uint seed)

Parameters

seed UInt32

llama_get_state_size(SafeLLamaContextHandle)

Returns the maximum size in bytes of the state (rng, logits, embedding and kv_cache) - will often be smaller after compacting tokens

public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)

Parameters

Returns

UInt64

**llama_copy_state_data(SafeLLamaContextHandle, Byte*)**

Copies the state to the specified destination address. Destination needs to have allocated enough memory.

public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)

Parameters

dest Byte*

Returns

UInt64
the number of bytes copied

**llama_set_state_data(SafeLLamaContextHandle, Byte*)**

Set the state reading from the specified address

public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)

Parameters

src Byte*

Returns

UInt64
the number of bytes read

llama_load_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64, UInt64&)

Load session file

public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens_out, ulong n_token_capacity, UInt64& n_token_count_out)

Parameters

path_session String

tokens_out LLamaToken[]

n_token_capacity UInt64

n_token_count_out UInt64&

Returns

llama_save_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64)

Save session file

public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens, ulong n_token_count)

Parameters

path_session String

tokens LLamaToken[]

n_token_count UInt64

Returns

llama_token_get_text(SafeLlamaModelHandle, LLamaToken)

public static Byte* llama_token_get_text(SafeLlamaModelHandle model, LLamaToken token)

Parameters

token LLamaToken

Returns

Byte*

llama_token_get_score(SafeLlamaModelHandle, LLamaToken)

public static float llama_token_get_score(SafeLlamaModelHandle model, LLamaToken token)

Parameters

token LLamaToken

Returns

Single

llama_token_get_type(SafeLlamaModelHandle, LLamaToken)

public static LLamaTokenType llama_token_get_type(SafeLlamaModelHandle model, LLamaToken token)

Parameters

token LLamaToken

Returns

LLamaTokenType

llama_n_ctx(SafeLLamaContextHandle)

Get the size of the context window for the model for this context

public static uint llama_n_ctx(SafeLLamaContextHandle ctx)

Parameters

Returns

UInt32

llama_n_batch(SafeLLamaContextHandle)

Get the batch size for this context

public static uint llama_n_batch(SafeLLamaContextHandle ctx)

Parameters

Returns

UInt32

llama_get_logits(SafeLLamaContextHandle)

Token logits obtained from the last call to llama_decode The logits for the last token are stored in the last row Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab

public static Single* llama_get_logits(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_get_logits_ith(SafeLLamaContextHandle, Int32)

Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab

public static Single* llama_get_logits_ith(SafeLLamaContextHandle ctx, int i)

Parameters

i Int32

Returns

llama_get_embeddings_ith(SafeLLamaContextHandle, Int32)

Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + i*n_embd

public static Single* llama_get_embeddings_ith(SafeLLamaContextHandle ctx, int i)

Parameters

i Int32

Returns