NativeApi
Namespace: LLama.Native
Direct translation of the llama.cpp API
1 | |
Inheritance Object → NativeApi
Methods
llama_sample_token_mirostat(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Int32, Single&)
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
m Int32
The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_mirostat_v2(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single&)
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_greedy(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Selects the token with the highest probability.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
llama_sample_token(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Randomly selects a token from the candidates based on their probabilities.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
<llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle)
1 | |
Parameters
Returns
<llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle, LLamaToken, Byte*, Int32)
1 | |
Parameters
model SafeLlamaModelHandle
llamaToken LLamaToken
buffer Byte*
length Int32
Returns
<TryLoadLibraries>g__TryLoad|84_0(String)
1 | |
Parameters
path String
Returns
<TryLoadLibraries>g__TryFindPath|84_1(String, <>c__DisplayClass84_0&)
1 | |
Parameters
filename String
Returns
llama_set_n_threads(SafeLLamaContextHandle, UInt32, UInt32)
Set the number of threads used for decoding
1 | |
Parameters
n_threads UInt32
n_threads is the number of threads used for generation (single token)
n_threads_batch UInt32
n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)
llama_vocab_type(SafeLlamaModelHandle)
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_rope_type(SafeLlamaModelHandle)
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_grammar_init(LLamaGrammarElement, UInt64, UInt64)**
Create a new grammar from the given set of grammar rules
1 | |
Parameters
rules LLamaGrammarElement**
n_rules UInt64
start_rule_index UInt64
Returns
llama_grammar_free(IntPtr)
Free all memory from the given SafeLLamaGrammarHandle
1 | |
Parameters
grammar IntPtr
llama_grammar_copy(SafeLLamaGrammarHandle)
Create a copy of an existing grammar instance
1 | |
Parameters
grammar SafeLLamaGrammarHandle
Returns
llama_sample_grammar(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaGrammarHandle)
Apply constraints from grammar
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
grammar SafeLLamaGrammarHandle
llama_grammar_accept_token(SafeLLamaContextHandle, SafeLLamaGrammarHandle, LLamaToken)
Accepts the sampled token into the grammar
1 | |
Parameters
grammar SafeLLamaGrammarHandle
token LLamaToken
llava_validate_embed_size(SafeLLamaContextHandle, SafeLlavaModelHandle)
Sanity check for clip <-> llava embed size match
1 | |
Parameters
ctxLlama SafeLLamaContextHandle
LLama Context
ctxClip SafeLlavaModelHandle
Llava Model
Returns
Boolean
True if validate successfully
llava_image_embed_make_with_bytes(SafeLlavaModelHandle, Int32, Byte[], Int32)
Build an image embed from image file bytes
1 | |
Parameters
ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads Int32
Number of threads
image_bytes Byte[]
Binary image in jpeg format
image_bytes_length Int32
Bytes lenght of the image
Returns
SafeLlavaImageEmbedHandle
SafeHandle to the Embeddings
llava_image_embed_make_with_filename(SafeLlavaModelHandle, Int32, String)
Build an image embed from a path to an image filename
1 | |
Parameters
ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads Int32
Number of threads
image_path String
Image filename (jpeg) to generate embeddings
Returns
SafeLlavaImageEmbedHandle
SafeHandel to the embeddings
llava_image_embed_free(IntPtr)
Free an embedding made with llava_image_embed_make_*
1 | |
Parameters
embed IntPtr
Embeddings to release
llava_eval_image_embed(SafeLLamaContextHandle, SafeLlavaImageEmbedHandle, Int32, Int32&)
Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed.
1 | |
Parameters
ctx_llama SafeLLamaContextHandle
Llama Context
embed SafeLlavaImageEmbedHandle
Embedding handle
n_batch Int32
n_past Int32&
Returns
Boolean
True on success
llama_model_quantize(String, String, LLamaModelQuantizeParams*)
Returns 0 on success
1 | |
Parameters
fname_inp String
fname_out String
param LLamaModelQuantizeParams*
Returns
UInt32
Returns 0 on success
llama_sample_repetition_penalties(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, LLamaToken*, UInt64, Single, Single, Single)
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix. Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens LLamaToken*
last_tokens_size UInt64
penalty_repeat Single
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
penalty_freq Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
penalty_present Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
llama_sample_apply_guidance(SafeLLamaContextHandle, Span<Single>, ReadOnlySpan<Single>, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
1 | |
Parameters
logits Span<Single>
Logits extracted from the original generation context.
logits_guidance ReadOnlySpan<Single>
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_apply_guidance(SafeLLamaContextHandle, Single, Single, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
1 | |
Parameters
logits Single*
Logits extracted from the original generation context.
logits_guidance Single*
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_softmax(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
llama_sample_top_k(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64)
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
k Int32
min_keep UInt64
llama_sample_top_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
llama_sample_min_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
llama_sample_tail_free(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
z Single
min_keep UInt64
llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single)
Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772.
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
min_temp Single
max_temp Single
exponent_val Single
llama_sample_temp(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single)
Modify logits by temperature
1 | |
Parameters
candidates LLamaTokenDataArrayNative&
temp Single
llama_get_embeddings(SafeLLamaContextHandle)
Get the embeddings for the input
1 | |
Parameters
Returns
llama_chat_apply_template(SafeLlamaModelHandle, Char, LLamaChatMessage, IntPtr, Boolean, Char*, Int32)
Apply chat template. Inspired by hf apply_chat_template() on python. Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model" NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
1 | |
Parameters
model SafeLlamaModelHandle
tmpl Char*
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.
chat LLamaChatMessage*
Pointer to a list of multiple llama_chat_message
n_msg IntPtr
Number of llama_chat_message in this chat
add_ass Boolean
Whether to end the prompt with the token(s) that indicate the start of an assistant message.
buf Char*
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)
length Int32
The size of the allocated buffer
Returns
Int32
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
llama_token_bos(SafeLlamaModelHandle)
Get the "Beginning of sentence" token
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_eos(SafeLlamaModelHandle)
Get the "End of sentence" token
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_nl(SafeLlamaModelHandle)
Get the "new line" token
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_add_bos_token(SafeLlamaModelHandle)
Returns -1 if unknown, 1 for true or 0 for false.
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_add_eos_token(SafeLlamaModelHandle)
Returns -1 if unknown, 1 for true or 0 for false.
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_prefix(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill prefix
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_middle(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill middle
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_suffix(SafeLlamaModelHandle)
codellama infill tokens, Beginning of infill suffix
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_token_eot(SafeLlamaModelHandle)
codellama infill tokens, End of infill middle
1 | |
Parameters
model SafeLlamaModelHandle
Returns
llama_print_timings(SafeLLamaContextHandle)
Print out timing information for this context
1 | |
Parameters
llama_reset_timings(SafeLLamaContextHandle)
Reset all collected timing information for this context
1 | |
Parameters
llama_print_system_info()
Print system information
1 | |
Returns
llama_token_to_piece(SafeLlamaModelHandle, LLamaToken, Span<Byte>)
Convert a single token into text
1 | |
Parameters
model SafeLlamaModelHandle
llamaToken LLamaToken
buffer Span<Byte>
buffer to write string into
Returns
Int32
The length written, or if the buffer is too small a negative that indicates the length required
llama_tokenize(SafeLlamaModelHandle, Byte, Int32, LLamaToken, Int32, Boolean, Boolean)
Convert text into tokens
1 | |
Parameters
model SafeLlamaModelHandle
text Byte*
text_len Int32
tokens LLamaToken*
n_max_tokens Int32
add_bos Boolean
special Boolean
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
Returns
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
llama_log_set(LLamaLogCallback)
Register a callback to receive llama log messages
1 | |
Parameters
logCallback LLamaLogCallback
llama_kv_cache_clear(SafeLLamaContextHandle)
Clear the KV cache
1 | |
Parameters
llama_kv_cache_seq_rm(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos)
Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
1 | |
Parameters
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
llama_kv_cache_seq_cp(SafeLLamaContextHandle, LLamaSeqId, LLamaSeqId, LLamaPos, LLamaPos)
Copy all tokens that belong to the specified sequence to another sequence Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence
1 | |
Parameters
src LLamaSeqId
dest LLamaSeqId
p0 LLamaPos
p1 LLamaPos
llama_kv_cache_seq_keep(SafeLLamaContextHandle, LLamaSeqId)
Removes all tokens that do not belong to the specified sequence
1 | |
Parameters
seq LLamaSeqId
llama_kv_cache_seq_add(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)
Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1) If the KV cache is RoPEd, the KV data is updated accordingly: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
1 | |
Parameters
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
delta Int32
llama_kv_cache_seq_div(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos, Int32)
Integer division of the positions by factor of d > 1
If the KV cache is RoPEd, the KV data is updated accordingly:
- lazily on next llama_decode()
- explicitly with llama_kv_cache_update()
p0 < 0 : [0, p1]
p1 < 0 : [p0, inf)
1 | |
Parameters
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
d Int32
llama_kv_cache_seq_pos_max(SafeLLamaContextHandle, LLamaSeqId)
Returns the largest position present in the KV cache for the specified sequence
1 | |
Parameters
seq LLamaSeqId
Returns
llama_kv_cache_defrag(SafeLLamaContextHandle)
Defragment the KV cache. This will be applied: - lazily on next llama_decode() - explicitly with llama_kv_cache_update()
1 | |
Parameters
Returns
llama_kv_cache_update(SafeLLamaContextHandle)
Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
1 | |
Parameters
llama_batch_init(Int32, Int32, Int32)
Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized
1 | |
Parameters
n_tokens Int32
embd Int32
n_seq_max Int32
Each token can be assigned up to n_seq_max sequence ids
Returns
llama_batch_free(LLamaNativeBatch)
Frees a batch of tokens allocated with llama_batch_init()
1 | |
Parameters
batch LLamaNativeBatch
llama_decode(SafeLLamaContextHandle, LLamaNativeBatch)
1 | |
Parameters
batch LLamaNativeBatch
Returns
Int32
Positive return values does not mean a fatal error, but rather a warning:
- 0: success
- 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)
- < 0: error
llama_kv_cache_view_init(SafeLLamaContextHandle, Int32)
Create an empty KV cache view. (use only for debugging purposes)
1 | |
Parameters
n_max_seq Int32
Returns
llama_kv_cache_view_free(LLamaKvCacheView&)
Free a KV cache view. (use only for debugging purposes)
1 | |
Parameters
view LLamaKvCacheView&
llama_kv_cache_view_update(SafeLLamaContextHandle, LLamaKvCacheView&)
Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)
1 | |
Parameters
view LLamaKvCacheView&
llama_get_kv_cache_token_count(SafeLLamaContextHandle)
Returns the number of tokens in the KV cache (slow, use only for debug) If a KV cell has multiple sequences assigned to it, it will be counted multiple times
1 | |
Parameters
Returns
llama_get_kv_cache_used_cells(SafeLLamaContextHandle)
Returns the number of used KV cells (i.e. have at least one sequence assigned to them)
1 | |
Parameters
Returns
llama_beam_search(SafeLLamaContextHandle, LLamaBeamSearchCallback, IntPtr, UInt64, Int32, Int32, Int32)
Deterministically returns entire sentence constructed by a beam search.
1 | |
Parameters
ctx SafeLLamaContextHandle
Pointer to the llama_context.
callback LLamaBeamSearchCallback
Invoked for each iteration of the beam_search loop, passing in beams_state.
callback_data IntPtr
A pointer that is simply passed back to callback.
n_beams UInt64
Number of beams to use.
n_past Int32
Number of tokens already evaluated.
n_predict Int32
Maximum number of tokens to predict. EOS may occur earlier.
n_threads Int32
Number of threads.
llama_empty_call()
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
1 | |
llama_max_devices()
Get the maximum number of devices supported by llama.cpp
1 | |
Returns
llama_model_default_params()
Create a LLamaModelParams with default values
1 | |
Returns
llama_context_default_params()
Create a LLamaContextParams with default values
1 | |
Returns
llama_model_quantize_default_params()
Create a LLamaModelQuantizeParams with default values
1 | |
Returns
llama_supports_mmap()
Check if memory mapping is supported
1 | |
Returns
llama_supports_mlock()
Check if memory locking is supported
1 | |
Returns
llama_supports_gpu_offload()
Check if GPU offload is supported
1 | |
Returns
llama_set_rng_seed(SafeLLamaContextHandle, UInt32)
Sets the current rng seed.
1 | |
Parameters
seed UInt32
llama_get_state_size(SafeLLamaContextHandle)
Returns the maximum size in bytes of the state (rng, logits, embedding and kv_cache) - will often be smaller after compacting tokens
1 | |
Parameters
Returns
llama_copy_state_data(SafeLLamaContextHandle, Byte*)
Copies the state to the specified destination address. Destination needs to have allocated enough memory.
1 | |
Parameters
dest Byte*
Returns
UInt64
the number of bytes copied
llama_set_state_data(SafeLLamaContextHandle, Byte*)
Set the state reading from the specified address
1 | |
Parameters
src Byte*
Returns
UInt64
the number of bytes read
llama_load_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64, UInt64&)
Load session file
1 | |
Parameters
path_session String
tokens_out LLamaToken[]
n_token_capacity UInt64
n_token_count_out UInt64&
Returns
llama_save_session_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64)
Save session file
1 | |
Parameters
path_session String
tokens LLamaToken[]
n_token_count UInt64
Returns
llama_token_get_text(SafeLlamaModelHandle, LLamaToken)
1 | |
Parameters
model SafeLlamaModelHandle
token LLamaToken
Returns
llama_token_get_score(SafeLlamaModelHandle, LLamaToken)
1 | |
Parameters
model SafeLlamaModelHandle
token LLamaToken
Returns
llama_token_get_type(SafeLlamaModelHandle, LLamaToken)
1 | |
Parameters
model SafeLlamaModelHandle
token LLamaToken
Returns
llama_n_ctx(SafeLLamaContextHandle)
Get the size of the context window for the model for this context
1 | |
Parameters
Returns
llama_n_batch(SafeLLamaContextHandle)
Get the batch size for this context
1 | |
Parameters
Returns
llama_get_logits(SafeLLamaContextHandle)
Token logits obtained from the last call to llama_decode
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
1 | |
Parameters
Returns
llama_get_logits_ith(SafeLLamaContextHandle, Int32)
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
1 | |
Parameters
i Int32
Returns
llama_get_embeddings_ith(SafeLLamaContextHandle, Int32)
Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + i*n_embd
1 | |
Parameters
i Int32