NativeApi
Namespace: LLama.Native
Direct translation of the llama.cpp API
public class NativeApi
Inheritance Object → NativeApi
Constructors
NativeApi()
public NativeApi()
Methods
llama_sample_token_mirostat(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Int32, Single&)
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, int m, Single& mu)
Parameters
candidates
LLamaTokenDataArrayNative&
A vector of llama_token_data
containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau
Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta
Single
The learning rate used to update mu
based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu
to be updated more quickly, while a smaller learning rate will result in slower updates.
m
Int32
The number of tokens considered in the estimation of s_hat
. This is an arbitrary value that is used to calculate s_hat
, which in turn helps to calculate the value of k
. In the paper, they use m = 100
, but you can experiment with different values to see how it affects the performance of the algorithm.
mu
Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau
) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_mirostat_v2(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, Single, Single&)
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static int llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)
Parameters
candidates
LLamaTokenDataArrayNative&
A vector of llama_token_data
containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau
Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta
Single
The learning rate used to update mu
based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu
to be updated more quickly, while a smaller learning rate will result in slower updates.
mu
Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau
) and is updated in the algorithm based on the error between the target and observed surprisal.
Returns
llama_sample_token_greedy(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Selects the token with the highest probability.
public static int llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
llama_sample_token(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Randomly selects a token from the candidates based on their probabilities.
public static int llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Returns
llama_token_to_str(SafeLLamaContextHandle, Int32)
Token Id -> String. Uses the vocabulary in the provided context
public static IntPtr llama_token_to_str(SafeLLamaContextHandle ctx, int token)
Parameters
token
Int32
Returns
IntPtr
Pointer to a string.
llama_token_bos(SafeLLamaContextHandle)
Get the "Beginning of sentence" token
public static int llama_token_bos(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_token_eos(SafeLLamaContextHandle)
Get the "End of sentence" token
public static int llama_token_eos(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_token_nl(SafeLLamaContextHandle)
Get the "new line" token
public static int llama_token_nl(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_print_timings(SafeLLamaContextHandle)
Print out timing information for this context
public static void llama_print_timings(SafeLLamaContextHandle ctx)
Parameters
llama_reset_timings(SafeLLamaContextHandle)
Reset all collected timing information for this context
public static void llama_reset_timings(SafeLLamaContextHandle ctx)
Parameters
llama_print_system_info()
Print system information
public static IntPtr llama_print_system_info()
Returns
llama_model_n_vocab(SafeLlamaModelHandle)
Get the number of tokens in the model vocabulary
public static int llama_model_n_vocab(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_model_n_ctx(SafeLlamaModelHandle)
Get the size of the context window for the model
public static int llama_model_n_ctx(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_model_n_embd(SafeLlamaModelHandle)
Get the dimension of embedding vectors from this model
public static int llama_model_n_embd(SafeLlamaModelHandle model)
Parameters
model
SafeLlamaModelHandle
Returns
llama_token_to_piece_with_model(SafeLlamaModelHandle, Int32, Byte*, Int32)
Convert a single token into text
public static int llama_token_to_piece_with_model(SafeLlamaModelHandle model, int llamaToken, Byte* buffer, int length)
Parameters
model
SafeLlamaModelHandle
llamaToken
Int32
buffer
Byte*
buffer to write string into
length
Int32
size of the buffer
Returns
Int32
The length writte, or if the buffer is too small a negative that indicates the length required
llama_tokenize_with_model(SafeLlamaModelHandle, Byte, Int32, Int32, Boolean)
Convert text into tokens
public static int llama_tokenize_with_model(SafeLlamaModelHandle model, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)
Parameters
model
SafeLlamaModelHandle
text
Byte*
tokens
Int32*
n_max_tokens
Int32
add_bos
Boolean
Returns
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
llama_log_set(LLamaLogCallback)
Register a callback to receive llama log messages
public static void llama_log_set(LLamaLogCallback logCallback)
Parameters
logCallback
LLamaLogCallback
llama_grammar_init(LLamaGrammarElement, UInt64, UInt64)**
Create a new grammar from the given set of grammar rules
public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)
Parameters
rules
LLamaGrammarElement**
n_rules
UInt64
start_rule_index
UInt64
Returns
llama_grammar_free(IntPtr)
Free all memory from the given SafeLLamaGrammarHandle
public static void llama_grammar_free(IntPtr grammar)
Parameters
grammar
IntPtr
llama_sample_grammar(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaGrammarHandle)
Apply constraints from grammar
public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)
Parameters
candidates
LLamaTokenDataArrayNative&
grammar
SafeLLamaGrammarHandle
llama_grammar_accept_token(SafeLLamaContextHandle, SafeLLamaGrammarHandle, Int32)
Accepts the sampled token into the grammar
public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, int token)
Parameters
grammar
SafeLLamaGrammarHandle
token
Int32
llama_model_quantize(String, String, LLamaModelQuantizeParams*)
Returns 0 on success
public static int llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)
Parameters
fname_inp
String
fname_out
String
param
LLamaModelQuantizeParams*
Returns
Int32
Returns 0 on success
Remarks:
not great API - very likely to change
llama_sample_classifier_free_guidance(SafeLLamaContextHandle, LLamaTokenDataArrayNative, SafeLLamaContextHandle, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative candidates, SafeLLamaContextHandle guidanceCtx, float scale)
Parameters
candidates
LLamaTokenDataArrayNative
A vector of llama_token_data
containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.
guidanceCtx
SafeLLamaContextHandle
A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale
Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_repetition_penalty(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32*, UInt64, Single)
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
public static void llama_sample_repetition_penalty(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float penalty)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens
Int32*
last_tokens_size
UInt64
penalty
Single
llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32*, UInt64, Single, Single)
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
public static void llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, Int32* last_tokens, ulong last_tokens_size, float alpha_frequency, float alpha_presence)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens
Int32*
last_tokens_size
UInt64
alpha_frequency
Single
alpha_presence
Single
llama_sample_classifier_free_guidance(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, SafeLLamaContextHandle, Single)
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_classifier_free_guidance(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaContextHandle guidance_ctx, float scale)
Parameters
candidates
LLamaTokenDataArrayNative&
A vector of llama_token_data
containing the candidate tokens, the logits must be directly extracted from the original generation context without being sorted.
guidance_ctx
SafeLLamaContextHandle
A separate context from the same model. Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale
Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
llama_sample_softmax(SafeLLamaContextHandle, LLamaTokenDataArrayNative&)
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
llama_sample_top_k(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Int32, UInt64)
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
k
Int32
min_keep
UInt64
llama_sample_top_p(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p
Single
min_keep
UInt64
llama_sample_tail_free(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
z
Single
min_keep
UInt64
llama_sample_typical(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single, UInt64)
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
Parameters
candidates
LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p
Single
min_keep
UInt64
llama_sample_temperature(SafeLLamaContextHandle, LLamaTokenDataArrayNative&, Single)
Modify logits by temperature
public static void llama_sample_temperature(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)
Parameters
candidates
LLamaTokenDataArrayNative&
temp
Single
llama_empty_call()
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
public static bool llama_empty_call()
Returns
llama_context_default_params()
Create a LLamaContextParams with default values
public static LLamaContextParams llama_context_default_params()
Returns
llama_model_quantize_default_params()
Create a LLamaModelQuantizeParams with default values
public static LLamaModelQuantizeParams llama_model_quantize_default_params()
Returns
llama_mmap_supported()
Check if memory mapping is supported
public static bool llama_mmap_supported()
Returns
llama_mlock_supported()
Check if memory lockingis supported
public static bool llama_mlock_supported()
Returns
llama_eval_export(SafeLLamaContextHandle, String)
Export a static computation graph for context of 511 and batch size of 1 NOTE: since this functionality is mostly for debugging and demonstration purposes, we hardcode these parameters here to keep things simple IMPORTANT: do not use for anything else other than debugging and testing!
public static int llama_eval_export(SafeLLamaContextHandle ctx, string fname)
Parameters
fname
String
Returns
llama_load_model_from_file(String, LLamaContextParams)
Various functions for loading a ggml llama model. Allocate (almost) all memory needed for the model. Return NULL on failure
public static IntPtr llama_load_model_from_file(string path_model, LLamaContextParams params)
Parameters
path_model
String
params
LLamaContextParams
Returns
llama_new_context_with_model(SafeLlamaModelHandle, LLamaContextParams)
Create a new llama_context with the given model. Return value should always be wrapped in SafeLLamaContextHandle!
public static IntPtr llama_new_context_with_model(SafeLlamaModelHandle model, LLamaContextParams params)
Parameters
model
SafeLlamaModelHandle
params
LLamaContextParams
Returns
llama_backend_init(Boolean)
not great API - very likely to change. Initialize the llama + ggml backend Call once at the start of the program
public static void llama_backend_init(bool numa)
Parameters
numa
Boolean
llama_free(IntPtr)
Frees all allocated memory in the given llama_context
public static void llama_free(IntPtr ctx)
Parameters
ctx
IntPtr
llama_free_model(IntPtr)
Frees all allocated memory associated with a model
public static void llama_free_model(IntPtr model)
Parameters
model
IntPtr
llama_model_apply_lora_from_file(SafeLlamaModelHandle, String, String, Int32)
Apply a LoRA adapter to a loaded model path_base_model is the path to a higher quality model to use as a base for the layers modified by the adapter. Can be NULL to use the current loaded model. The model needs to be reloaded before applying a new adapter, otherwise the adapter will be applied on top of the previous one
public static int llama_model_apply_lora_from_file(SafeLlamaModelHandle model_ptr, string path_lora, string path_base_model, int n_threads)
Parameters
model_ptr
SafeLlamaModelHandle
path_lora
String
path_base_model
String
n_threads
Int32
Returns
Int32
Returns 0 on success
llama_get_kv_cache_token_count(SafeLLamaContextHandle)
Returns the number of tokens in the KV cache
public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_set_rng_seed(SafeLLamaContextHandle, Int32)
Sets the current rng seed.
public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, int seed)
Parameters
seed
Int32
llama_get_state_size(SafeLLamaContextHandle)
Returns the maximum size in bytes of the state (rng, logits, embedding and kv_cache) - will often be smaller after compacting tokens
public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_copy_state_data(SafeLLamaContextHandle, Byte*)
Copies the state to the specified destination address. Destination needs to have allocated enough memory.
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)
Parameters
dest
Byte*
Returns
UInt64
the number of bytes copied
llama_copy_state_data(SafeLLamaContextHandle, Byte[])
Copies the state to the specified destination address. Destination needs to have allocated enough memory (see llama_get_state_size)
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte[] dest)
Parameters
dest
Byte[]
Returns
UInt64
the number of bytes copied
llama_set_state_data(SafeLLamaContextHandle, Byte*)
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)
Parameters
src
Byte*
Returns
UInt64
the number of bytes read
llama_set_state_data(SafeLLamaContextHandle, Byte[])
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte[] src)
Parameters
src
Byte[]
Returns
UInt64
the number of bytes read
llama_load_session_file(SafeLLamaContextHandle, String, Int32[], UInt64, UInt64*)
Load session file
public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens_out, ulong n_token_capacity, UInt64* n_token_count_out)
Parameters
path_session
String
tokens_out
Int32[]
n_token_capacity
UInt64
n_token_count_out
UInt64*
Returns
llama_save_session_file(SafeLLamaContextHandle, String, Int32[], UInt64)
Save session file
public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens, ulong n_token_count)
Parameters
path_session
String
tokens
Int32[]
n_token_count
UInt64
Returns
llama_eval(SafeLLamaContextHandle, Int32[], Int32, Int32, Int32)
Run the llama inference to obtain the logits and probabilities for the next token. tokens + n_tokens is the provided batch of new tokens to process n_past is the number of tokens to use from previous eval calls
public static int llama_eval(SafeLLamaContextHandle ctx, Int32[] tokens, int n_tokens, int n_past, int n_threads)
Parameters
tokens
Int32[]
n_tokens
Int32
n_past
Int32
n_threads
Int32
Returns
Int32
Returns 0 on success
llama_eval_with_pointer(SafeLLamaContextHandle, Int32*, Int32, Int32, Int32)
Run the llama inference to obtain the logits and probabilities for the next token. tokens + n_tokens is the provided batch of new tokens to process n_past is the number of tokens to use from previous eval calls
public static int llama_eval_with_pointer(SafeLLamaContextHandle ctx, Int32* tokens, int n_tokens, int n_past, int n_threads)
Parameters
tokens
Int32*
n_tokens
Int32
n_past
Int32
n_threads
Int32
Returns
Int32
Returns 0 on success
llama_tokenize(SafeLLamaContextHandle, String, Encoding, Int32[], Int32, Boolean)
Convert the provided text into tokens.
public static int llama_tokenize(SafeLLamaContextHandle ctx, string text, Encoding encoding, Int32[] tokens, int n_max_tokens, bool add_bos)
Parameters
text
String
encoding
Encoding
tokens
Int32[]
n_max_tokens
Int32
add_bos
Boolean
Returns
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
llama_tokenize_native(SafeLLamaContextHandle, Byte, Int32, Int32, Boolean)
Convert the provided text into tokens.
public static int llama_tokenize_native(SafeLLamaContextHandle ctx, Byte* text, Int32* tokens, int n_max_tokens, bool add_bos)
Parameters
text
Byte*
tokens
Int32*
n_max_tokens
Int32
add_bos
Boolean
Returns
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
llama_n_vocab(SafeLLamaContextHandle)
Get the number of tokens in the model vocabulary for this context
public static int llama_n_vocab(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_n_ctx(SafeLLamaContextHandle)
Get the size of the context window for the model for this context
public static int llama_n_ctx(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_n_embd(SafeLLamaContextHandle)
Get the dimension of embedding vectors from the model for this context
public static int llama_n_embd(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_get_logits(SafeLLamaContextHandle)
Token logits obtained from the last call to llama_eval()
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
public static Single* llama_get_logits(SafeLLamaContextHandle ctx)
Parameters
Returns
llama_get_embeddings(SafeLLamaContextHandle)
Get the embeddings for the input shape: [n_embd] (1-dimensional)
public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)