NativeApi
Namespace: LLama.Native
Direct translation of the llama.cpp API
public static class NativeApi
Inheritance Object → NativeApi
Methods
llama_sample_token_mirostat(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Int32, Single&)
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static LLamaToken llama_sample_token_mirostat(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, int m, Single& mu)
Parameters
candidates
LLamaTokenDataArrayNative&
A vector of llama_token_data
containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau
Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta
Single
The learning rate used to update mu
based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu
to be updated more quickly, while a smaller learning rate will result in slower updates.
m
Int32
The number of tokens considered in the estimation of s_hat
. This is an arbitrary value that is used to calculate s_hat
, which in turn helps to calculate the value of k
. In the paper, they use m = 100
, but you can experiment with different values to see how it affects the performance of the algorithm.
mu
Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau
) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_mirostat_v2(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single&)
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static LLamaToken llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)
Parameters
candidates
LLamaTokenDataArrayNative&
A vector of llama_token_data
containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau
Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta
Single
The learning rate used to update mu
based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu
to be updated more quickly, while a smaller learning rate will result in slower updates.
mu
Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau
) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_greedy(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Selects the token with the highest probability.
public static LLamaToken llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
llama_sample_token(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Randomly selects a token from the candidates based on their probabilities.
public static LLamaToken llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
<llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle)
internal static Single* <llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle ctx)
Parameters
Returns
<llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle, LLamaToken, Byte*, Int32)
internal static int <llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle model, LLamaToken llamaToken, Byte* buffer, int length)
Parameters
model
SafeLlamaModelHandle
llamaToken
LLamaToken
buffer
Byte*
length
Int32
Returns
<TryLoadLibraries>g__TryLoad|84_0(String)
internal static IntPtr <TryLoadLibraries>g__TryLoad|84_0(string path)
Parameters
path
String
Returns
<TryLoadLibraries>g__TryFindPath|84_1(String, <>c__DisplayClass84_0&)
internal static string <TryLoadLibraries>g__TryFindPath|84_1(string filename, <>c__DisplayClass84_0& )
Parameters
filename
String
Returns
llama_set_n_threads(SafeLLamaContextHandle, UInt32, UInt32)
Set the number of threads used for decoding
public static void llama_set_n_threads(SafeLLamaContextHandle ctx, uint n_threads, uint n_threads_batch)
Parameters
n_threads
UInt32
n_threads is the number of threads used for generation (single token)
n_threads_batch
UInt32
n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)
llama_vocab_type(SafeLlamaModelHandle)
public static LLamaVocabType llama_vocab_type(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_rope_type(SafeLlamaModelHandle)
public static LLamaRopeType llama_rope_type(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_grammar_init(LLamaGrammarElement, UInt64, UInt64)**
Create a new grammar from the given set of grammar rules
public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)
Parameters
rules
LLamaGrammarElement**
n_rules
UInt64
start_rule_index
UInt64
Returns
llama_grammar_free(IntPtr)
Free all memory from the given SafeLLamaGrammarHandle
public static void llama_grammar_free(IntPtr grammar)
Parameters
grammar
IntPtr
llama_grammar_copy(SafeLLamaGrammarHandle)
Create a copy of an existing grammar instance
public static IntPtr llama_grammar_copy(SafeLLamaGrammarHandle grammar)
Parameters
grammar
SafeLLamaGrammarHandle
Returns
llama_sample_grammar(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaGrammarHandle)
Apply constraints from grammar
public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)
Parameters
candidates
LLamaTokenDataArrayNative&
grammar
SafeLLamaGrammarHandle
llama_grammar_accept_token(SafeLLamaContextHandle, SafeLLamaGrammarHandle, LLamaToken)
Accepts the sampled token into the grammar
public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, LLamaToken token)
Parameters
grammar
SafeLLamaGrammarHandle
token
LLamaToken
llava_validate_embed_size(SafeLLamaContextHandle, SafeLlavaModelHandle)
Sanity check for clip <-> llava embed size match
public static bool llava_validate_embed_size(SafeLLamaContextHandle ctxLlama, SafeLlavaModelHandle ctxClip)
Parameters
ctxLlama
SafeLLamaContextHandle
LLama Context
ctxClip
SafeLlavaModelHandle
Llava Model
Returns
Boolean
True if validate successfully
llava_image_embed_make_with_bytes(SafeLlavaModelHandle, Int32, Byte[], Int32)
Build an image embed from image file bytes
public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_bytes(SafeLlavaModelHandle ctx_clip, int n_threads, Byte[] image_bytes, int image_bytes_length)
Parameters
ctx_clip
SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads
Int32
Number of threads
image_bytes
Byte[]
Binary image in jpeg format
image_bytes_length
Int32
Bytes lenght of the image
Returns
SafeLlavaImageEmbedHandle
SafeHandle to the Embeddings
llava_image_embed_make_with_filename(SafeLlavaModelHandle, Int32, String)
Build an image embed from a path to an image filename
public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_filename(SafeLlavaModelHandle ctx_clip, int n_threads, string image_path)
Parameters
ctx_clip
SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads
Int32
Number of threads
image_path
String
Image filename (jpeg) to generate embeddings
Returns
SafeLlavaImageEmbedHandle
SafeHandel to the embeddings
llava_image_embed_free(IntPtr)
Free an embedding made with llava_image_embed_make_*
public static void llava_image_embed_free(IntPtr embed)
Parameters
embed
IntPtr
Embeddings to release
llava_eval_image_embed(SafeLLamaContextHandle, SafeLlavaImageEmbedHandle, Int32, Int32&)
Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed.
public static bool llava_eval_image_embed(SafeLLamaContextHandle ctx_llama, SafeLlavaImageEmbedHandle embed, int n_batch, Int32& n_past)
Parameters
ctx_llama
SafeLLamaContextHandle
Llama Context
embed
SafeLlavaImageEmbedHandle
Embedding handle
n_batch
Int32
n_past
Int32&
Returns
Boolean
True on success
llama_model_quantize(String, String, LLamaModelQuantizeParams*)
Returns 0 on success
public static uint llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)
Parameters
fname_inp
String
fname_out
String
param
LLamaModelQuantizeParams*
Returns
UInt32
Returns 0 on success
llama_sample_repetition_penalties(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, LLamaToken*, UInt64, Single, Single, Single)
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix. Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
public static void llama_sample_repetition_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, LLamaToken* last_tokens, ulong last_tokens_size, float penalty_repeat, float penalty_freq, float penalty_present)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens
LLamaToken*
last_tokens_size
UInt64
penalty_repeat
Single
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
penalty_freq
Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
penalty_present
Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
llama_sample_apply_guidance(SafeLLamaContextHandle, Span<Single>, ReadOnlySpan<Single>, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Span<float> logits, ReadOnlySpan<float> logits_guidance, float scale)
Parameters
logits
Span<Single>
Logits extracted from the original generation context.
logits_guidance
ReadOnlySpan<Single>
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale
Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_apply_guidance(SafeLLamaContextHandle, Single, Single, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Single* logits, Single* logits_guidance, float scale)
Parameters
logits
Single*
Logits extracted from the original generation context.
logits_guidance
Single*
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale
Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_softmax(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
llama_sample_top_k(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64)
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
k
Int32
min_keep
UInt64
llama_sample_top_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p
Single
min_keep
UInt64
llama_sample_min_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
public static void llama_sample_min_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p
Single
min_keep
UInt64
llama_sample_tail_free(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
z
Single
min_keep
UInt64
llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p
Single
min_keep
UInt64
llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single)
Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float min_temp, float max_temp, float exponent_val)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
min_temp
Single
max_temp
Single
exponent_val
Single
llama_sample_temp(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single)
Modify logits by temperature
public static void llama_sample_temp(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)
Parameters
candidates
LLamaTokenDataArrayNative&
temp
Single
llama_get_embeddings(SafeLLamaContextHandle)
Get the embeddings for the input
public static Span<float> llama_get_embeddings(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_chat_apply_template(SafeLlamaModelHandle, Char, LLamaChatMessage, IntPtr, Boolean, Char*, Int32)
Apply chat template. Inspired by hf apply_chat_template() on python. Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model" NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
public static int llama_chat_apply_template(SafeLlamaModelHandle model, Char* tmpl, LLamaChatMessage* chat, IntPtr n_msg, bool add_ass, Char* buf, int length)
Parameters
model
SafeLlamaModelHandle
tmpl
Char*
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.
chat
LLamaChatMessage*
Pointer to a list of multiple llama_chat_message
n_msg
IntPtr
Number of llama_chat_message in this chat
add_ass
Boolean
Whether to end the prompt with the token(s) that indicate the start of an assistant message.
buf
Char*
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)
length
Int32
The size of the allocated buffer
Returns
Int32
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
llama_token_bos(SafeLlamaModelHandle)
Get the "Beginning of sentence" token
public static LLamaToken llama_token_bos(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_eos(SafeLlamaModelHandle)
Get the "End of sentence" token
public static LLamaToken llama_token_eos(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_nl(SafeLlamaModelHandle)
Get the "new line" token
public static LLamaToken llama_token_nl(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_add_bos_token(SafeLlamaModelHandle)
Returns -1 if unknown, 1 for true or 0 for false.
public static int llama_add_bos_token(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_add_eos_token(SafeLlamaModelHandle)
Returns -1 if unknown, 1 for true or 0 for false.
public static int llama_add_eos_token(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_prefix(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill prefix
public static int llama_token_prefix(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_middle(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill middle
public static int llama_token_middle(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_suffix(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill suffix
public static int llama_token_suffix(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_eot(SafeLlamaModelHandle)
codellama infill tokens, End of infill middle
public static int llama_token_eot(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_print_timings(SafeLLamaContextHandle)
Print out timing information for this context
public static void llama_print_timings(SafeLLamaContextHandle ctx)
Parameters
llama_reset_timings(SafeLLamaContextHandle)
Reset all collected timing information for this context
public static void llama_reset_timings(SafeLLamaContextHandle ctx)
Parameters
llama_print_system_info()
Print system information
public static IntPtr llama_print_system_info()
Returns
llama_token_to_piece(SafeLlamaModelHandle, LLamaToken, Span<Byte>)
Convert a single token into text
public static int llama_token_to_piece(SafeLlamaModelHandle model, LLamaToken llamaToken, Span<byte> buffer)
Parameters
model
SafeLlamaModelHandle
llamaToken
LLamaToken
buffer
Span<Byte>
buffer to write string into
Returns
Int32
The length written, or if the buffer is too small a negative that indicates the length required
llama_tokenize(SafeLlamaModelHandle, Byte, Int32, LLamaToken, Int32, Boolean, Boolean)
Convert text into tokens
public static int llama_tokenize(SafeLlamaModelHandle model, Byte* text, int text_len, LLamaToken* tokens, int n_max_tokens, bool add_bos, bool special)
Parameters
model
SafeLlamaModelHandle
text
Byte*
text_len
Int32
tokens
LLamaToken*
n_max_tokens
Int32
add_bos
Boolean
special
Boolean
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
Returns
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
llama_log_set(LLamaLogCallback)
Register a callback to receive llama log messages
public static void llama_log_set(LLamaLogCallback logCallback)
Parameters
logCallback
LLamaLogCallback
llama_kv_cache_clear(SafeLLamaContextHandle)
Clear the KV cache
public static void llama_kv_cache_clear(SafeLLamaContextHandle ctx)
Parameters
llama_kv_cache_seq_rm(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos)
Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
public static void llama_kv_cache_seq_rm(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1)
Parameters
seq
LLamaSeqId
p0
LLamaPos
p1
LLamaPos
llama_kv_cache_seq_cp(SafeLLamaContextHandle, LLamaSeqId, LLamaSeqId, LLamaPos, LLamaPos)
Copy all tokens that belong to the specified sequence to another sequence Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence
public static void llama_kv_cache_seq_cp(SafeLLamaContextHandle ctx, LLamaSeqId src, LLamaSeqId dest, LLamaPos p0, LLamaPos p1)
Parameters
src
LLamaSeqId
dest
LLamaSeqId
p0
LLamaPos
p1
LLamaPos
llama_kv_cache_seq_keep(SafeLLamaContextHandle, LLamaSeqId)
Removes all tokens that do not belong to the specified sequence
public static void llama_kv_cache_seq_keep(SafeLLamaContextHandle ctx, LLamaSeqId seq)
Parameters
seq
LLamaSeqId
llama_kv_cache_seq_add(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)
Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1) If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
public static void llama_kv_cache_seq_add(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int delta)
Parameters
seq
LLamaSeqId
p0
LLamaPos
p1
LLamaPos
delta
Int32
llama_kv_cache_seq_div(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)
Integer division of the positions by factor of d > 1
If the KV cache is RoPEd, the KV data is updated accordingly:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
public static void llama_kv_cache_seq_div(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int d)
Parameters
seq
LLamaSeqId
p0
LLamaPos
p1
LLamaPos
d
Int32
llama_kv_cache_seq_pos_max(SafeLLamaContextHandle, LLamaSeqId)
Returns the largest position present in the KV cache for the specified sequence
public static LLamaPos llama_kv_cache_seq_pos_max(SafeLLamaContextHandle ctx, LLamaSeqId seq)
Parameters
seq
LLamaSeqId
Returns
llama_kv_cache_defrag(SafeLLamaContextHandle)
Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
public static LLamaPos llama_kv_cache_defrag(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_kv_cache_update(SafeLLamaContextHandle)
Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
public static void llama_kv_cache_update(SafeLLamaContextHandle ctx)
Parameters
llama_batch_init(Int32, Int32, Int32)
Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized
public static LLamaNativeBatch llama_batch_init(int n_tokens, int embd, int n_seq_max)
Parameters
n_tokens
Int32
embd
Int32
n_seq_max
Int32
Each token can be assigned up to n_seq_max sequence ids
Returns
llama_batch_free(LLamaNativeBatch)
Frees a batch of tokens allocated with llama_batch_init()
public static void llama_batch_free(LLamaNativeBatch batch)
Parameters
batch
LLamaNativeBatch
llama_decode(SafeLLamaContextHandle, LLamaNativeBatch)
public static int llama_decode(SafeLLamaContextHandle ctx, LLamaNativeBatch batch)
Parameters
batch
LLamaNativeBatch
Returns
Int32
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
llama_kv_cache_view_init(SafeLLamaContextHandle, Int32)
Create an empty KV cache view. (use only for debugging purposes)
public static LLamaKvCacheView llama_kv_cache_view_init(SafeLLamaContextHandle ctx, int n_max_seq)
Parameters
n_max_seq
Int32
Returns
llama_kv_cache_view_free(LLamaKvCacheView&)
Free a KV cache view. (use only for debugging purposes)
public static void llama_kv_cache_view_free(LLamaKvCacheView& view)
Parameters
view
LLamaKvCacheView&
llama_kv_cache_view_update(SafeLLamaContextHandle, LLamaKvCacheView&)
Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)
public static void llama_kv_cache_view_update(SafeLLamaContextHandle ctx, LLamaKvCacheView& view)
Parameters
view
LLamaKvCacheView&
llama_get_kv_cache_token_count(SafeLLamaContextHandle)
Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times
public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_get_kv_cache_used_cells(SafeLLamaContextHandle)
Returns the number of used KV cells (i.e. have at least one sequence assigned to them)
public static int llama_get_kv_cache_used_cells(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_beam_search(SafeLLamaContextHandle, LLamaBeamSearchCallback, IntPtr, UInt64, Int32, Int32, Int32)
Deterministically returns entire sentence constructed by a beam search.
public static void llama_beam_search(SafeLLamaContextHandle ctx, LLamaBeamSearchCallback callback, IntPtr callback_data, ulong n_beams, int n_past, int n_predict, int n_threads)
Parameters
ctx
SafeLLamaContextHandle
Pointer to the llama_context.
callback
LLamaBeamSearchCallback
Invoked for each iteration of the beam_search loop, passing in beams_state.
callback_data
IntPtr
A pointer that is simply passed back to callback.
n_beams
UInt64
Number of beams to use.
n_past
Int32
Number of tokens already evaluated.
n_predict
Int32
Maximum number of tokens to predict. EOS may occur earlier.
n_threads
Int32
Number of threads.
llama_empty_call()
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
public static void llama_empty_call()
llama_max_devices()
Get the maximum number of devices supported by llama.cpp
public static long llama_max_devices()
Returns
llama_model_default_params()
Create a LLamaModelParams with default values
public static LLamaModelParams llama_model_default_params()
Returns
llama_context_default_params()
Create a LLamaContextParams with default values
public static LLamaContextParams llama_context_default_params()
Returns
llama_model_quantize_default_params()
Create a LLamaModelQuantizeParams with default values
public static LLamaModelQuantizeParams llama_model_quantize_default_params()
Returns
llama_supports_mmap()
Check if memory mapping is supported
public static bool llama_supports_mmap()
Returns
llama_supports_mlock()
Check if memory locking is supported
public static bool llama_supports_mlock()
Returns
llama_supports_gpu_offload()
Check if GPU offload is supported
public static bool llama_supports_gpu_offload()
Returns
llama_set_rng_seed(SafeLLamaContextHandle, UInt32)
Sets the current rng seed.
public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, uint seed)
Parameters
seed
UInt32
llama_get_state_size(SafeLLamaContextHandle)
Returns the maximum size in bytes of the state (rng, logits, embedding and kv_cache) - will often be smaller after compacting tokens
public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_copy_state_data(SafeLLamaContextHandle, Byte*)
Copies the state to the specified destination address. Destination needs to have allocated enough memory.
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)
Parameters
dest
Byte*
Returns
UInt64
the number of bytes copied
llama_set_state_data(SafeLLamaContextHandle, Byte*)
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)
Parameters
src
Byte*
Returns
UInt64
the number of bytes read
llama_load_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64, UInt64&)
Load session file
public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens_out, ulong n_token_capacity, UInt64& n_token_count_out)
Parameters
path_session
String
tokens_out
LLamaToken[]
n_token_capacity
UInt64
n_token_count_out
UInt64&
Returns
llama_save_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64)
Save session file
public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens, ulong n_token_count)
Parameters
path_session
String
tokens
LLamaToken[]
n_token_count
UInt64
Returns
llama_token_get_text(SafeLlamaModelHandle, LLamaToken)
public static Byte* llama_token_get_text(SafeLlamaModelHandle model, LLamaToken token)
Parameters
model
SafeLlamaModelHandle
token
LLamaToken
Returns
llama_token_get_score(SafeLlamaModelHandle, LLamaToken)
public static float llama_token_get_score(SafeLlamaModelHandle model, LLamaToken token)
Parameters
model
SafeLlamaModelHandle
token
LLamaToken
Returns
llama_token_get_type(SafeLlamaModelHandle, LLamaToken)
public static LLamaTokenType llama_token_get_type(SafeLlamaModelHandle model, LLamaToken token)
Parameters
model
SafeLlamaModelHandle
token
LLamaToken
Returns
llama_n_ctx(SafeLLamaContextHandle)
Get the size of the context window for the model for this context
public static uint llama_n_ctx(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_n_batch(SafeLLamaContextHandle)
Get the batch size for this context
public static uint llama_n_batch(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_get_logits(SafeLLamaContextHandle)
Token logits obtained from the last call to llama_decode
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
public static Single* llama_get_logits(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_get_logits_ith(SafeLLamaContextHandle, Int32)
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
public static Single* llama_get_logits_ith(SafeLLamaContextHandle ctx, int i)
Parameters
i
Int32
Returns
llama_get_embeddings_ith(SafeLLamaContextHandle, Int32)
Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + i*n_embd
public static Single* llama_get_embeddings_ith(SafeLLamaContextHandle ctx, int i)
Parameters
i
Int32