NativeApi

Namespace: LLama.Native

Direct translation of the llama.cpp API

public static class NativeApi

Inheritance Object → NativeApi
Attributes NullableContextAttribute, NullableAttribute

Methods

llama_empty_call()

A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.

public static void llama_empty_call()

llama_backend_free()

Call once at the end of the program - currently only used for MPI

public static void llama_backend_free()

llama_max_devices()

Get the maximum number of devices supported by llama.cpp

public static long llama_max_devices()

Returns

Int64

llama_supports_mmap()

Check if memory mapping is supported

public static bool llama_supports_mmap()

Returns

Boolean

llama_supports_mlock()

Check if memory locking is supported

public static bool llama_supports_mlock()

Returns

Boolean

llama_supports_gpu_offload()

Check if GPU offload is supported

public static bool llama_supports_gpu_offload()

Returns

Boolean

llama_supports_rpc()

Check if RPC offload is supported

public static bool llama_supports_rpc()

Returns

Boolean

llama_state_load_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64, UInt64&)

Load session file

public static bool llama_state_load_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens_out, ulong n_token_capacity, UInt64& n_token_count_out)

Parameters

ctx SafeLLamaContextHandle

path_session String

tokens_out LLamaToken[]

n_token_capacity UInt64

n_token_count_out UInt64&

Returns

Boolean

llama_state_save_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64)

Save session file

public static bool llama_state_save_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens, ulong n_token_count)

Parameters

ctx SafeLLamaContextHandle

path_session String

tokens LLamaToken[]

n_token_count UInt64

Returns

Boolean

**llama_state_seq_save_file(SafeLLamaContextHandle, String, LLamaSeqId, LLamaToken*, UIntPtr)**

Saves the specified sequence as a file on specified filepath. Can later be loaded via NativeApi.llama_state_load_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64, UInt64&)

public static UIntPtr llama_state_seq_save_file(SafeLLamaContextHandle ctx, string filepath, LLamaSeqId seq_id, LLamaToken* tokens, UIntPtr n_token_count)

Parameters

ctx SafeLLamaContextHandle

filepath String

seq_id LLamaSeqId

tokens LLamaToken*

n_token_count UIntPtr

Returns

UIntPtr

**llama_state_seq_load_file(SafeLLamaContextHandle, String, LLamaSeqId, LLamaToken*, UIntPtr, UIntPtr&)**

Loads a sequence saved as a file via NativeApi.llama_state_save_file(SafeLLamaContextHandle, String, LLamaToken[], UInt64) into the specified sequence

public static UIntPtr llama_state_seq_load_file(SafeLLamaContextHandle ctx, string filepath, LLamaSeqId dest_seq_id, LLamaToken* tokens_out, UIntPtr n_token_capacity, UIntPtr& n_token_count_out)

Parameters

ctx SafeLLamaContextHandle

filepath String

dest_seq_id LLamaSeqId

tokens_out LLamaToken*

n_token_capacity UIntPtr

n_token_count_out UIntPtr&

Returns

UIntPtr

llama_set_causal_attn(SafeLLamaContextHandle, Boolean)

Set whether to use causal attention or not. If set to true, the model will only attend to the past tokens

public static void llama_set_causal_attn(SafeLLamaContextHandle ctx, bool causalAttn)

Parameters

ctx SafeLLamaContextHandle

causalAttn Boolean

llama_set_embeddings(SafeLLamaContextHandle, Boolean)

Set whether the model is in embeddings mode or not.

public static void llama_set_embeddings(SafeLLamaContextHandle ctx, bool embeddings)

Parameters

ctx SafeLLamaContextHandle

embeddings Boolean
If true, embeddings will be returned but logits will not

llama_set_abort_callback(SafeLlamaModelHandle, IntPtr, IntPtr)

Set abort callback

public static void llama_set_abort_callback(SafeLlamaModelHandle ctx, IntPtr abortCallback, IntPtr abortCallbackData)

Parameters

ctx SafeLlamaModelHandle

abortCallback IntPtr

abortCallbackData IntPtr

llama_n_seq_max(SafeLLamaContextHandle)

Get the n_seq_max for this context

public static uint llama_n_seq_max(SafeLLamaContextHandle ctx)

Parameters

ctx SafeLLamaContextHandle

Returns

UInt32

llama_get_embeddings(SafeLLamaContextHandle)

Get all output token embeddings. When pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model, the embeddings for which llama_batch.logits[i] != 0 are stored contiguously in the order they have appeared in the batch. shape: [n_outputs*n_embd] Otherwise, returns an empty span.

public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)

Parameters

ctx SafeLLamaContextHandle

Returns

Single*

**llama_chat_apply_template(Byte, LLamaChatMessage, UIntPtr, Boolean, Byte*, Int32)**

Apply chat template. Inspired by hf apply_chat_template() on python.

public static int llama_chat_apply_template(Byte* tmpl, LLamaChatMessage* chat, UIntPtr n_msg, bool add_ass, Byte* buf, int length)

Parameters

tmpl Byte*
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.

chat LLamaChatMessage*
Pointer to a list of multiple llama_chat_message

n_msg UIntPtr
Number of llama_chat_message in this chat

add_ass Boolean
Whether to end the prompt with the token(s) that indicate the start of an assistant message.

buf Byte*
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)

length Int32
The size of the allocated buffer

Returns

Int32
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.

llama_chat_builtin_templates(Char, UIntPtr)**

Get list of built-in chat templates

public static int llama_chat_builtin_templates(Char** output, UIntPtr len)

Parameters

output Char**

len UIntPtr

Returns

Int32

llama_print_timings(SafeLLamaContextHandle)

Print out timing information for this context

public static void llama_print_timings(SafeLLamaContextHandle ctx)

Parameters

ctx SafeLLamaContextHandle

llama_print_system_info()

Print system information

public static IntPtr llama_print_system_info()

Returns

IntPtr

llama_token_to_piece(Vocabulary, LLamaToken, Span<Byte>, Int32, Boolean)

Convert a single token into text

public static int llama_token_to_piece(Vocabulary vocab, LLamaToken llamaToken, Span<byte> buffer, int lstrip, bool special)

Parameters

vocab Vocabulary

llamaToken LLamaToken

buffer Span<Byte>
buffer to write string into

lstrip Int32
User can skip up to 'lstrip' leading spaces before copying (useful when encoding/decoding multiple tokens with 'add_space_prefix')

special Boolean
If true, special tokens are rendered in the output

Returns

Int32
The length written, or if the buffer is too small a negative that indicates the length required

llama_log_set(LLamaLogCallback)

Caution

Use NativeLogConfig.llama_log_set instead

Register a callback to receive llama log messages

public static void llama_log_set(LLamaLogCallback logCallback)

Parameters

logCallback LLamaLogCallback

llama_kv_self_seq_rm(SafeLLamaContextHandle, LLamaSeqId, LLamaPos, LLamaPos)

Removes all tokens that belong to the specified sequence and have positions in [p0, p1)

public static bool llama_kv_self_seq_rm(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1)

Parameters

ctx SafeLLamaContextHandle

seq LLamaSeqId

p0 LLamaPos

p1 LLamaPos

Returns

Boolean
Returns false if a partial sequence cannot be removed. Removing a whole sequence never fails

llama_batch_init(Int32, Int32, Int32)

Allocates a batch of tokens on the heap Each token can be assigned up to n_seq_max sequence ids The batch has to be freed with llama_batch_free() If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float) Otherwise, llama_batch.token will be allocated to store n_tokens llama_token The rest of the llama_batch members are allocated with size n_tokens All members are left uninitialized

public static LLamaNativeBatch llama_batch_init(int n_tokens, int embd, int n_seq_max)

Parameters

n_tokens Int32

embd Int32

n_seq_max Int32
Each token can be assigned up to n_seq_max sequence ids

Returns

LLamaNativeBatch

llama_batch_free(LLamaNativeBatch)

Frees a batch of tokens allocated with llama_batch_init()

public static void llama_batch_free(LLamaNativeBatch batch)

Parameters

batch LLamaNativeBatch

**llama_apply_adapter_cvec(SafeLLamaContextHandle, Single*, UIntPtr, Int32, Int32, Int32)**

Apply a loaded control vector to a llama_context, or if data is NULL, clear the currently loaded vector. n_embd should be the size of a single layer's control, and data should point to an n_embd x n_layers buffer starting from layer 1. il_start and il_end are the layer range the vector should apply to (both inclusive) See llama_control_vector_load in common to load a control vector.

public static int llama_apply_adapter_cvec(SafeLLamaContextHandle ctx, Single* data, UIntPtr len, int n_embd, int il_start, int il_end)

Parameters

ctx SafeLLamaContextHandle

data Single*

len UIntPtr

n_embd Int32

il_start Int32

il_end Int32

Returns

Int32

llama_split_path(String, UIntPtr, String, Int32, Int32)

Build a split GGUF final path for this chunk. llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf"

public static int llama_split_path(string split_path, UIntPtr maxlen, string path_prefix, int split_no, int split_count)

Parameters

split_path String

maxlen UIntPtr

path_prefix String

split_no Int32

split_count Int32

Returns

Int32
Returns the split_path length.

llama_split_prefix(String, UIntPtr, String, Int32, Int32)

Extract the path prefix from the split_path if and only if the split_no and split_count match. llama_split_prefix(split_prefix, 64, "/models/ggml-model-q4_0-00002-of-00004.gguf", 2, 4) => split_prefix = "/models/ggml-model-q4_0"

public static int llama_split_prefix(string split_prefix, UIntPtr maxlen, string split_path, int split_no, int split_count)

Parameters

split_prefix String

maxlen UIntPtr

split_path String

split_no Int32

split_count Int32

Returns

Int32
Returns the split_prefix length.

ggml_backend_dev_count()

Get the number of available backend devices

public static UIntPtr ggml_backend_dev_count()

Returns

UIntPtr
Count of available backend devices

ggml_backend_dev_get(UIntPtr)

Get a backend device by index

public static IntPtr ggml_backend_dev_get(UIntPtr i)

Parameters

i UIntPtr
Device index

Returns

IntPtr
Pointer to the backend device

ggml_backend_dev_buffer_type(IntPtr)

Get the buffer type for a backend device

public static IntPtr ggml_backend_dev_buffer_type(IntPtr dev)

Parameters

dev IntPtr
Backend device pointer

Returns

IntPtr
Pointer to the buffer type

ggml_backend_buft_name(IntPtr)

Get the name of a buffer type

public static IntPtr ggml_backend_buft_name(IntPtr buft)

Parameters

buft IntPtr
Buffer type pointer

Returns

IntPtr
Name of the buffer type

llava_validate_embed_size(SafeLLamaContextHandle, SafeLlavaModelHandle)

Sanity check for clip <-> llava embed size match

public static bool llava_validate_embed_size(SafeLLamaContextHandle ctxLlama, SafeLlavaModelHandle ctxClip)

Parameters

ctxLlama SafeLLamaContextHandle
LLama Context

ctxClip SafeLlavaModelHandle
Llava Model

Returns

Boolean
True if validate successfully

llava_image_embed_make_with_bytes(SafeLlavaModelHandle, Int32, Byte[], Int32)

Build an image embed from image file bytes

public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_bytes(SafeLlavaModelHandle ctx_clip, int n_threads, Byte[] image_bytes, int image_bytes_length)

Parameters

ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model

n_threads Int32
Number of threads

image_bytes Byte[]
Binary image in jpeg format

image_bytes_length Int32
Bytes length of the image

Returns

SafeLlavaImageEmbedHandle
SafeHandle to the Embeddings

llava_image_embed_make_with_filename(SafeLlavaModelHandle, Int32, String)

Build an image embed from a path to an image filename

public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_filename(SafeLlavaModelHandle ctx_clip, int n_threads, string image_path)

Parameters

ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model

n_threads Int32
Number of threads

image_path String
Image filename (jpeg) to generate embeddings

Returns

SafeLlavaImageEmbedHandle
SafeHandle to the embeddings

llava_image_embed_free(IntPtr)

Free an embedding made with llava_image_embed_make_*

public static void llava_image_embed_free(IntPtr embed)

Parameters

embed IntPtr
Embeddings to release

llava_eval_image_embed(SafeLLamaContextHandle, SafeLlavaImageEmbedHandle, Int32, Int32&)

Write the image represented by embed into the llama context with batch size n_batch, starting at context pos n_past. on completion, n_past points to the next position in the context after the image embed.

public static bool llava_eval_image_embed(SafeLLamaContextHandle ctx_llama, SafeLlavaImageEmbedHandle embed, int n_batch, Int32& n_past)

Parameters

ctx_llama SafeLLamaContextHandle
Llama Context

embed SafeLlavaImageEmbedHandle
Embedding handle

n_batch Int32

n_past Int32&

Returns

Boolean
True on success

GetLoadedNativeLibrary(NativeLibraryName)

Get the loaded native library. If you are using netstandard2.0, it will always return null.

public static INativeLibrary GetLoadedNativeLibrary(NativeLibraryName name)

llama_model_quantize(String, String, LLamaModelQuantizeParams&)

Returns 0 on success

public static uint llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams& param)

Parameters

fname_inp String

fname_out String

param LLamaModelQuantizeParams&

Returns

UInt32
Returns 0 on success

< Back