Table of Contents

Namespace NumSharp.Backends.Kernels

Classes

DirectILKernelGenerator

Binary operations (same-type) - contiguous kernels and generic helpers.

GeneratedDelegates

Central registry for every runtime-generated kernel cache. Each cache exposes a public read-only *Count (live cache size) and an internal reset.

ILKernelGenerator

Generates per-chunk IL kernels for NDIter-driven execution.

Kernels emitted here are called as the inner loop of an NDIter iteration — once per chunk, with dataptrs/strides/count provided by the iterator. The kernel does no axis or stride walking of its own.

Add new kernel families in ILKernelGenerator.<Op>.cs partial files. See DirectILKernelGenerator for the legacy whole-array kernels currently being migrated to this model.

ReductionOpExtensions

Extension methods for ReductionOp.

ReductionTypeExtensions

Extension methods for NPTypeCode related to reductions.

SimdDot

SIMD fused multiply-accumulate dot product for contiguous float / double vectors. Computes sum(a[i] * b[i]) in a single pass — no temporary product array (contrast with left * right followed by ReduceAdd, which materializes an n-element temp and walks the data twice).

Four independent Vector256 accumulators give the out-of-order core enough instruction-level parallelism to hide FMA latency; a scalar tail handles the remainder. Accumulation type matches the element type (double in double, float in float) so the result dtype mirrors NumPy's np.dot.

Callers route only contiguous (stride == 1) same-type operands here; strided views take a scalar strided loop, and non-float dtypes take the INumber<T> path — both in Default.Dot.Fused.cs.

SimdMatMul

High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.

Key optimizations:

  • GEBP (General Block Panel) algorithm with cache blocking
  • Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
  • 8x16 micro-kernel with 16 Vector256 accumulators
  • FMA (Fused Multiply-Add) for 2x FLOP throughput
  • 4x k-loop unrolling for instruction-level parallelism

Stride-aware variants (see SimdMatMul.Strided.cs / SimdMatMul.Double.cs) accept (stride0, stride1) for each operand so transposed / sliced NDArray views can be matmul'd without materializing contiguous copies.

StrideDetector

Stride-based pattern detection for selecting optimal SIMD execution paths. All methods are aggressively inlined for minimal dispatch overhead.

Structs

AxisReductionKernelKey

Cache key for axis reduction kernels. Reduces along a specific axis, producing an array with one fewer dimension.

BinaryScalarKernelKey

Cache key for binary scalar operation kernels. Identifies a unique kernel by LHS type, RHS type, result type, and operation.

ComparisonKernelKey

Cache key for comparison operation kernels. Identifies a unique kernel by LHS type, RHS type, operation, and execution path. Result type is always bool (NPTypeCode.Boolean).

ComparisonScalarKernelKey

Cache key for comparison scalar operation kernels. Identifies a unique kernel by LHS type, RHS type, and comparison operation. Result type is always bool.

CopyKernelKey
CumulativeAxisKernelKey

Cache key for cumulative axis reduction kernels (cumsum along axis, etc.). Output has same shape as input, cumulative accumulation along specified axis.

CumulativeKernelKey

Cache key for cumulative reduction kernels (cumsum, etc.). Output has same shape as input, each element is accumulation of elements before it.

DirectILKernelGenerator.WeightedSumKernelKey
ElementReductionKernelKey

Cache key for element-wise (full array) reduction kernels. Reduces all elements to a single scalar value.

ILKernelGenerator.ReduceKernelKey

Cache key for a per-chunk reduction kernel: operation + input dtype + accumulator (output) dtype. Layout is handled at runtime inside the kernel body (pinned vs slab, contiguous vs strided inner loop), so it is NOT part of the key.

ILKernelGenerator.ScanAxisAux

Scan-axis geometry passed to a cumulative inner-loop kernel via auxdata. Byte strides + element count for the (removed) scan axis; the iterator supplies the per-operand byte strides for the remaining innermost axis through the kernel's strides argument.

IndexCollector

A growable buffer for collecting long indices, backed by NDArray storage. Replaces LongIndexBuffer - uses NumSharp's existing unmanaged memory infrastructure.

MixedTypeKernelKey

Cache key for mixed-type binary operation kernels. Identifies a unique kernel by LHS type, RHS type, result type, operation, and execution path.

UnaryKernelKey

Cache key for unary operation kernels. Identifies a unique kernel by input type, output type, operation, and whether contiguous.

UnaryScalarKernelKey

Cache key for unary scalar operation kernels. Identifies a unique kernel by input type, output type, and operation.

Enums

BinaryOp

Binary operations supported by kernel providers.

ComparisonOp

Comparison operations supported by kernel providers. All comparison operations return bool (NPTypeCode.Boolean).

CopyExecutionPath
DirectILKernelGenerator.ClipBoundsKind
DirectILKernelGenerator.ClipMode
ExecutionPath

Execution paths for binary operations, selected based on stride analysis.

ReductionOp

Reduction operations supported by kernel providers.

UnaryOp

Unary operations supported by kernel providers.

Delegates

ArgwhereCountKernel

IL-emitted SIMD popcount: returns the count of non-zero elements in a contiguous T-typed buffer. size is in elements (not bytes).

ArgwhereExpandKernel

IL-emitted coord-expand: converts a flat-index buffer (monotonic ascending C-order) to the (count, ndim) row-major argwhere result via incremental coord advance — no per-element divmod. Dtype-agnostic (operates on long*).

ArgwhereFlatKernel

IL-emitted SIMD bit-scan: writes flat element indices of non-zero T positions into outBuf (caller must pre-size via ArgwhereCountKernel). Returns the number written.

AxisReductionKernel

Delegate for axis reduction kernels. Reduces along a specific axis, writing to output array.

ComparisonKernel

Comparison operation kernel signature using void pointers. LHS and RHS may have different types, but result is always bool. Type conversion is handled internally by the generated IL.

ContiguousKernel<T>

Delegate for contiguous (SimdFull) binary operations. Simplified signature - no strides needed since both arrays are contiguous.

CopyKernel
CumulativeAxisKernel

Delegate for cumulative axis reduction kernels. Computes running accumulation along a specific axis.

CumulativeKernel

Delegate for cumulative reduction kernels (cumsum, etc.). Output has same shape as input.

DirectILKernelGenerator.CastKernel

Cross-dtype contiguous copy kernel. Both src and dst must be contiguous.

DirectILKernelGenerator.ClipKernel

Universal clip kernel signature: read from src, clamp to lo / hi, write to dst. size is element count. Bound pointers are interpreted per the mode the kernel was generated for (scalar = single value, array = size values). Bound pointers for unused sides are ignored by the generated IL.

DirectILKernelGenerator.InnerCastLoop

Convert count elements src→dst, advancing each pointer by its byte stride per element. One direct call Converts.To{Dst} body.

DirectILKernelGenerator.MaskedCastKernel

where-masked cross-dtype copy kernel delegate.

DirectILKernelGenerator.QuantileKernel
DirectILKernelGenerator.RepeatBroadcastKernel

Broadcast variant — every j uses the same count.

DirectILKernelGenerator.RepeatPerJKernel

Per-j variant — counts[j] varies; must have length n.

DirectILKernelGenerator.SearchSortedKernel

SearchSorted kernel delegate.

DirectILKernelGenerator.ShiftArrayKernel<T>

Delegate for shift operation with per-element shift amounts. This is the scalar loop path for element-wise shifts.

DirectILKernelGenerator.ShiftScalarKernel<T>

Delegate for shift operation with scalar shift amount. This is the SIMD-optimized path for uniform shifts.

DirectILKernelGenerator.StridedCastKernel

Cross-dtype strided/broadcast copy kernel.

ElementReductionKernel

Delegate for element-wise reduction kernels. Reduces all elements of an array to a single value.

FilterAxisKernel

IL-emitted fused mask-driven gather. Reads src at each True position in mask and emits one slab per outer block into dst. The runtime innerSize is honoured by the "bulk" variant; the typed variants (1/2/4/8/16-byte) ignore it (the size is baked into IL).

IndicesKernel

IL-emitted slab filler for np.indices. Writes the multi-axis coordinate of each output position directly via blockwise SIMD memsets; no per-element divmod or coord advance.

MatMul2DKernel<T>

Kernel delegate for 2D matrix multiplication: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices are row-major contiguous.

MixedTypeKernel

Mixed-type binary operation kernel signature using void pointers. Handles operations where LHS, RHS, and result may have different types. Type conversion is handled internally by the generated IL.

NonZeroPerDimKernel

IL-emitted coord expand for np.nonzero: converts a flat-index buffer (monotonic ascending C-order) into ndim separate per-dim coordinate columns via incremental coord advance — no per-element divmod. Dtype-agnostic (operates on long*).

outCols is a pointer to an array of ndim long* pointers, one per output dimension. outCols[d][i] receives the coordinate along dim d for the i'th non-zero element.

PlaceKernel

IL-emitted mask-driven scatter for np.place. For each i in [0, maskSize) with mask[i] true, writes dst[i] = values[j % valuesCount] and advances j.

PutKernel

IL-emitted scatter kernel for np.put. Writes dst[apply_mode(indices[i])] = values[i % valuesCount] for each i in [0, indicesCount).

RavelMultiIndexKernel

IL-emitted multi-coord→flat-index folder for np.ravel_multi_index. Caller pre-casts each coord array to contig int64 and computes ravelStrides (C or F order baked into the strides); the kernel reads coords[d][i] linearly, applies the per-axis modes clipping/wrapping, and writes the summed flat index into outIndices.

StridedUnaryKernel

Fused strided-source unary kernel: builds each SIMD vector directly from a strided 1-D source via lane-count scalar gathers, applies the unary op, and stores contiguously — single pass, no scratch buffer, no per-tile dispatch.

TakeKernel

IL-emitted gather kernel for np.take. The source is treated as a 3-D layout (outerSize, maxItem, innerSize-bytes). For each (outer, j) pair the kernel reads indices[j], applies mode, and copies innerSize bytes from the source slab to the destination position.

TraceKernel

Walks the diagonal of one or more 2-D sub-arrays and accumulates each into the promoted dtype. Per-src dtype kernel; accum / result dtypes are baked into the IL at emit time.

TypedElementReductionKernel<TResult>

Delegate for typed element-wise reduction kernels. Returns the reduced value directly without boxing.

UnaryKernel

Unary operation kernel signature using void pointers. Handles operations where input and output may have different types. Type conversion is handled internally by the generated IL.

UnravelIndexKernel

IL-emitted flat→multi-coord expander for np.unravel_index. Caller pre-casts indices to int64 and computes unravelSize = product of dims; the kernel reads indices linearly, validates each value against [0, unravelSize), and emits per-axis coords into outCols using the C-order or F-order extraction direction selected by idxStart / idxStep.

WhereKernel<T>

Delegate for where operation kernels.

WhereScalarXKernel<T>
WhereScalarXYKernel<T>
WhereScalarYKernel<T>