Namespace NumSharp.Backends.Kernels
Classes
- DirectILKernelGenerator
Binary operations (same-type) - contiguous kernels and generic helpers.
- GeneratedDelegates
Central registry for every runtime-generated kernel cache. Each cache exposes a public read-only
*Count(live cache size) and aninternalreset.
- ILKernelGenerator
Generates per-chunk IL kernels for NDIter-driven execution.
Kernels emitted here are called as the inner loop of an NDIter iteration — once per chunk, with dataptrs/strides/count provided by the iterator. The kernel does no axis or stride walking of its own.
Add new kernel families in
ILKernelGenerator.<Op>.cspartial files. See DirectILKernelGenerator for the legacy whole-array kernels currently being migrated to this model.
- ReductionOpExtensions
Extension methods for ReductionOp.
- ReductionTypeExtensions
Extension methods for NPTypeCode related to reductions.
- SimdDot
SIMD fused multiply-accumulate dot product for contiguous float / double vectors. Computes
sum(a[i] * b[i])in a single pass — no temporary product array (contrast withleft * rightfollowed byReduceAdd, which materializes an n-element temp and walks the data twice).Four independent Vector256 accumulators give the out-of-order core enough instruction-level parallelism to hide FMA latency; a scalar tail handles the remainder. Accumulation type matches the element type (double in double, float in float) so the result dtype mirrors NumPy's
np.dot.Callers route only contiguous (stride == 1) same-type operands here; strided views take a scalar strided loop, and non-float dtypes take the INumber<T> path — both in
Default.Dot.Fused.cs.
- SimdMatMul
High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.
Key optimizations:
- GEBP (General Block Panel) algorithm with cache blocking
- Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
- 8x16 micro-kernel with 16 Vector256 accumulators
- FMA (Fused Multiply-Add) for 2x FLOP throughput
- 4x k-loop unrolling for instruction-level parallelism
Stride-aware variants (see SimdMatMul.Strided.cs / SimdMatMul.Double.cs) accept (stride0, stride1) for each operand so transposed / sliced NDArray views can be matmul'd without materializing contiguous copies.
- StrideDetector
Stride-based pattern detection for selecting optimal SIMD execution paths. All methods are aggressively inlined for minimal dispatch overhead.
Structs
- AxisReductionKernelKey
Cache key for axis reduction kernels. Reduces along a specific axis, producing an array with one fewer dimension.
- BinaryScalarKernelKey
Cache key for binary scalar operation kernels. Identifies a unique kernel by LHS type, RHS type, result type, and operation.
- ComparisonKernelKey
Cache key for comparison operation kernels. Identifies a unique kernel by LHS type, RHS type, operation, and execution path. Result type is always bool (NPTypeCode.Boolean).
- ComparisonScalarKernelKey
Cache key for comparison scalar operation kernels. Identifies a unique kernel by LHS type, RHS type, and comparison operation. Result type is always bool.
- CumulativeAxisKernelKey
Cache key for cumulative axis reduction kernels (cumsum along axis, etc.). Output has same shape as input, cumulative accumulation along specified axis.
- CumulativeKernelKey
Cache key for cumulative reduction kernels (cumsum, etc.). Output has same shape as input, each element is accumulation of elements before it.
- ElementReductionKernelKey
Cache key for element-wise (full array) reduction kernels. Reduces all elements to a single scalar value.
- ILKernelGenerator.ReduceKernelKey
Cache key for a per-chunk reduction kernel: operation + input dtype + accumulator (output) dtype. Layout is handled at runtime inside the kernel body (pinned vs slab, contiguous vs strided inner loop), so it is NOT part of the key.
- ILKernelGenerator.ScanAxisAux
Scan-axis geometry passed to a cumulative inner-loop kernel via
auxdata. Byte strides + element count for the (removed) scan axis; the iterator supplies the per-operand byte strides for the remaining innermost axis through the kernel'sstridesargument.
- IndexCollector
A growable buffer for collecting long indices, backed by NDArray storage. Replaces LongIndexBuffer - uses NumSharp's existing unmanaged memory infrastructure.
- MixedTypeKernelKey
Cache key for mixed-type binary operation kernels. Identifies a unique kernel by LHS type, RHS type, result type, operation, and execution path.
- UnaryKernelKey
Cache key for unary operation kernels. Identifies a unique kernel by input type, output type, operation, and whether contiguous.
- UnaryScalarKernelKey
Cache key for unary scalar operation kernels. Identifies a unique kernel by input type, output type, and operation.
Enums
- BinaryOp
Binary operations supported by kernel providers.
- ComparisonOp
Comparison operations supported by kernel providers. All comparison operations return bool (NPTypeCode.Boolean).
- ExecutionPath
Execution paths for binary operations, selected based on stride analysis.
- ReductionOp
Reduction operations supported by kernel providers.
- UnaryOp
Unary operations supported by kernel providers.
Delegates
- ArgwhereCountKernel
IL-emitted SIMD popcount: returns the count of non-zero elements in a contiguous T-typed buffer.
sizeis in elements (not bytes).
- ArgwhereExpandKernel
IL-emitted coord-expand: converts a flat-index buffer (monotonic ascending C-order) to the (
count,ndim) row-major argwhere result via incremental coord advance — no per-element divmod. Dtype-agnostic (operates on long*).
- ArgwhereFlatKernel
IL-emitted SIMD bit-scan: writes flat element indices of non-zero T positions into
outBuf(caller must pre-size via ArgwhereCountKernel). Returns the number written.
- AxisReductionKernel
Delegate for axis reduction kernels. Reduces along a specific axis, writing to output array.
- ComparisonKernel
Comparison operation kernel signature using void pointers. LHS and RHS may have different types, but result is always bool. Type conversion is handled internally by the generated IL.
- ContiguousKernel<T>
Delegate for contiguous (SimdFull) binary operations. Simplified signature - no strides needed since both arrays are contiguous.
- CumulativeAxisKernel
Delegate for cumulative axis reduction kernels. Computes running accumulation along a specific axis.
- CumulativeKernel
Delegate for cumulative reduction kernels (cumsum, etc.). Output has same shape as input.
- DirectILKernelGenerator.CastKernel
Cross-dtype contiguous copy kernel. Both
srcanddstmust be contiguous.
- DirectILKernelGenerator.ClipKernel
Universal clip kernel signature: read from
src, clamp tolo/hi, write todst.sizeis element count. Bound pointers are interpreted per the mode the kernel was generated for (scalar = single value, array =sizevalues). Bound pointers for unused sides are ignored by the generated IL.
- DirectILKernelGenerator.InnerCastLoop
Convert
countelements src→dst, advancing each pointer by its byte stride per element. One directcall Converts.To{Dst}body.
- DirectILKernelGenerator.MaskedCastKernel
where-masked cross-dtype copy kernel delegate.
- DirectILKernelGenerator.RepeatBroadcastKernel
Broadcast variant — every j uses the same
count.
- DirectILKernelGenerator.RepeatPerJKernel
Per-j variant — counts[j] varies; must have length
n.
- DirectILKernelGenerator.SearchSortedKernel
SearchSorted kernel delegate.
- DirectILKernelGenerator.ShiftArrayKernel<T>
Delegate for shift operation with per-element shift amounts. This is the scalar loop path for element-wise shifts.
- DirectILKernelGenerator.ShiftScalarKernel<T>
Delegate for shift operation with scalar shift amount. This is the SIMD-optimized path for uniform shifts.
- DirectILKernelGenerator.StridedCastKernel
Cross-dtype strided/broadcast copy kernel.
- ElementReductionKernel
Delegate for element-wise reduction kernels. Reduces all elements of an array to a single value.
- FilterAxisKernel
IL-emitted fused mask-driven gather. Reads
srcat each True position inmaskand emits one slab per outer block intodst. The runtimeinnerSizeis honoured by the "bulk" variant; the typed variants (1/2/4/8/16-byte) ignore it (the size is baked into IL).
- IndicesKernel
IL-emitted slab filler for
np.indices. Writes the multi-axis coordinate of each output position directly via blockwise SIMD memsets; no per-element divmod or coord advance.
- MatMul2DKernel<T>
Kernel delegate for 2D matrix multiplication: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices are row-major contiguous.
- MixedTypeKernel
Mixed-type binary operation kernel signature using void pointers. Handles operations where LHS, RHS, and result may have different types. Type conversion is handled internally by the generated IL.
- NonZeroPerDimKernel
IL-emitted coord expand for
np.nonzero: converts a flat-index buffer (monotonic ascending C-order) intondimseparate per-dim coordinate columns via incremental coord advance — no per-element divmod. Dtype-agnostic (operates on long*).outColsis a pointer to an array ofndimlong*pointers, one per output dimension.outCols[d][i]receives the coordinate along dimdfor thei'th non-zero element.
- PlaceKernel
IL-emitted mask-driven scatter for
np.place. For eachiin[0, maskSize)withmask[i]true, writesdst[i] = values[j % valuesCount]and advancesj.
- PutKernel
IL-emitted scatter kernel for
np.put. Writesdst[apply_mode(indices[i])] = values[i % valuesCount]for eachiin[0, indicesCount).
- RavelMultiIndexKernel
IL-emitted multi-coord→flat-index folder for
np.ravel_multi_index. Caller pre-casts each coord array to contig int64 and computesravelStrides(C or F order baked into the strides); the kernel readscoords[d][i] linearly, applies the per-axismodesclipping/wrapping, and writes the summed flat index intooutIndices.
- StridedUnaryKernel
Fused strided-source unary kernel: builds each SIMD vector directly from a strided 1-D source via lane-count scalar gathers, applies the unary op, and stores contiguously — single pass, no scratch buffer, no per-tile dispatch.
- TakeKernel
IL-emitted gather kernel for
np.take. The source is treated as a 3-D layout (outerSize, maxItem, innerSize-bytes). For each (outer, j) pair the kernel readsindices[j], appliesmode, and copiesinnerSizebytes from the source slab to the destination position.
- TraceKernel
Walks the diagonal of one or more 2-D sub-arrays and accumulates each into the promoted dtype. Per-src dtype kernel; accum / result dtypes are baked into the IL at emit time.
- TypedElementReductionKernel<TResult>
Delegate for typed element-wise reduction kernels. Returns the reduced value directly without boxing.
- UnaryKernel
Unary operation kernel signature using void pointers. Handles operations where input and output may have different types. Type conversion is handled internally by the generated IL.
- UnravelIndexKernel
IL-emitted flat→multi-coord expander for
np.unravel_index. Caller pre-casts indices to int64 and computesunravelSize= product ofdims; the kernel readsindiceslinearly, validates each value against[0, unravelSize), and emits per-axis coords intooutColsusing the C-order or F-order extraction direction selected byidxStart/idxStep.
- WhereKernel<T>
Delegate for where operation kernels.