LLamaModelQuantizeParams

Namespace: LLama.Native

Quantizer parameters used in the native API

public struct LLamaModelQuantizeParams

Remarks:

llama_model_quantize_params

Fields

number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency()

public int nthread;

quantize to this llama_ftype

public LLamaFtype ftype;

output tensor type

public GGMLType output_tensor_type;

token embeddings tensor type

public GGMLType token_embedding_type;

pointer to importance matrix data

public IntPtr imatrix;

pointer to vector containing overrides

public IntPtr kv_overrides;

pointer to vector containing tensor types

public IntPtr tensor_types;

allow quantizing non-f32/f16 tensors

public bool allow_requantize { get; set; }

quantize output.weight

public bool quantize_output_tensor { get; set; }

only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored

public bool only_copy { get; set; }

quantize all tensors to the default type

public bool pure { get; set; }

quantize to the same number of shards

public bool keep_split { get; set; }

Create a LLamaModelQuantizeParams with default values

LLamaModelQuantizeParams Default()