Table of Contents

Namespace NumSharp.Backends.Iteration

Classes

ArrayNode
BinaryNode
CallNode
ComparisonNode
ConstNode
DelegateSlots
InputNode
MinMaxNode
NDAxisIter
NDExpr

Abstract expression node. Subclasses describe computations over NDIter operands; Compile() produces an NDInnerLoopFunc.

NDExprCompileContext
NDFlatIterator

Flat (1-D, C-order) element iterator — the NumSharp analog of NumPy's flatiter, used by iters.

It wraps an operand already broadcast (via broadcast_to(NDArray, Shape)) to the broadcast result shape, and yields each logical element — in C-order, expanding stride-0 (broadcast) dimensions — exactly like NumPy's broadcast.iters[i]:

// numpy: np.broadcast([1,2,3], [[10],[20]]).iters[0] -> 1,2,3,1,2,3
//                                            .iters[1] -> 10,10,10,20,20,20

The broadcast expansion is the same Shape/stride machinery NDIter uses; element access resolves the (possibly stride-0) coordinates per step, so no buffer is materialized.

NDIter

Static iterator helper methods (backward compatible API).

NUMSHARP DIVERGENCE: These methods support unlimited dimensions via dynamic allocation. Dimension arrays are allocated on demand and freed after use.

NDIterBufferManager

Buffer management for NDIter. Handles allocation, copy-in, and copy-out of iteration buffers.

NDIterCasting

Type casting utilities for NDIter. Validates casting rules and performs type conversions.

NDIterCoalescing

Axis coalescing logic for NDIter. Merges adjacent compatible axes to reduce iteration overhead.

NUMSHARP DIVERGENCE: This implementation supports unlimited dimensions. Uses StridesNDim for stride array indexing (allocated based on actual ndim).

NDIterConstants

NDIter-related bit-packing constants that don't belong on the flag enums.

NDIterExecution

Execution helpers for different paths.

NDIterFlagMasks

Bit masks that partition the NDIter flag space into global (bits 0-15) and per-operand (bits 16-31) regions. Matches NumPy's NPY_ITER_GLOBAL_FLAGS and NPY_ITER_PER_OP_FLAGS macros.

NDIterPathSelector

Execution path selection logic.

NDIterUtils

Helper utilities for NDIter op_axes encoding/decoding.

ReduceNode
UnaryNode
WhereNode

Structs

ComplexAllKernel
ComplexAnyKernel
ComplexArgAccumulator

ArgMin/ArgMax accumulator for Complex — lexicographic best plus the same running-index / NaN-index bookkeeping as HalfArgAccumulator.

ComplexArgMaxKernel
ComplexArgMinKernel
ComplexMaxKernel

Complex max via lexicographic (real, then imaginary) compare. On the first NaN-bearing element the kernel stores it verbatim and aborts (see ComplexMinMaxAccumulator).

ComplexMinKernel
ComplexMinMaxAccumulator

Min/Max accumulator for Complex: the running lexicographic extremum, a "seen any value" flag, and a NaN flag. On the first element whose real OR imaginary part is NaN the kernel stores that element VERBATIM in Best and aborts — matching NumPy's minimum/maximum, which return the NaN-bearing operand as-is (e.g. min([1+1j, nan+0j]) → (nan,0), not (nan,nan)). Because the iterator runs in NPY_KEEPORDER, the element captured is the first NaN in MEMORY order, which is exactly what NumPy's reduce returns for a non-C-contiguous (e.g. transposed) array.

ComplexProdKernel

Complex product. The cross-term multiply (a+bi)(c+di) cannot be expressed as an independent-lane SIMD reduction, so this is a scalar fold; the win over the delegate path is devirtualization + a register-held accumulator.

ComplexSumKernel

Complex sum. When the inner loop is contiguous (stride == 16 bytes = one Complex) and Vector256 is hardware-accelerated, the chunk is summed as a flat double stream with two Vector256<double> lanes (real/imag interleaved survive the lane reduction), then the tail is added scalar. Non-contiguous chunks add scalar. The SIMD reassociation differs from a strict left fold only at ULP level (same class as the codebase's pairwise reductions).

CountNonZeroKernel<T>
CumProdAxisKernel<T>
CumSumAxisKernel<T>
HalfAllKernel
HalfAnyKernel
HalfArgAccumulator

ArgMin/ArgMax accumulator for Half. Cur is the running C-order flat index of the chunk's first element (callers MUST use NPY_CORDER). BestIdx starts at -1 as the "no value yet" sentinel. SawNaNIdx is the flat index of the first NaN (NumPy: argmin/argmax of an array containing NaN return the first NaN's index); -1 until a NaN is seen.

HalfArgMaxKernel
HalfArgMinKernel
HalfMaxKernel

Half max. Contiguous chunks (stride == 2) run a 4-accumulator unroll that breaks the per-element dependency chain (~1.3× the scalar fold); other strides take the scalar branch. NaN propagates: the kernel aborts the moment a NaN is seen so the caller returns Half.NaN.

HalfMinKernel

Half min — mirror of HalfMaxKernel with the comparison and the unroll seeds inverted (+inf).

HalfMinMaxAccumulator

Min/Max accumulator for Half: running extremum (held in double, the precision the codebase's f16 reductions already use), a "seen any value" flag for the empty/first-element guard, and a NaN flag. Any NaN ⇒ result NaN (NumPy: min/max with NaN propagates), so the kernel aborts on the first NaN it sees.

HalfProdKernel
HalfSumKernel
NDAllKernel<T>
NDAnyKernel<T>
NDAxisState
NDIterRef

High-performance multi-operand iterator matching NumPy's nditer API.

NDIterState

Core iterator state with dynamically allocated arrays for both dimensions and operands.

NUMSHARP DIVERGENCE: Unlike NumPy's fixed NPY_MAXDIMS=64 and NPY_MAXARGS=64, NumSharp supports unlimited dimensions AND unlimited operands. All arrays are allocated dynamically based on actual NDim and NOp values.

NDMaxAxisKernel<T>

Max reduction kernel for axis operations.

NDMinAxisKernel<T>

Min reduction kernel for axis operations.

NDProdAxisKernel<T>

Product reduction kernel for axis operations.

NDSumAxisKernel<T>

Sum reduction kernel for axis operations.

NanMaxDoubleKernel
NanMaxFloatKernel
NanMaxHalfKernel
NanMeanAccumulator

Accumulator for nanmean: running sum and count of non-NaN elements.

NanMeanComplexAccumulator

nanmean accumulator for Complex: running Complex sum and non-NaN count.

NanMeanComplexKernel
NanMeanDoubleKernel
NanMeanFloatKernel
NanMeanHalfKernel
NanMinDoubleKernel
NanMinFloatKernel
NanMinHalfKernel
NanMinMaxDoubleAccumulator

Accumulator for NanMin/NanMax on double arrays.

NanMinMaxFloatAccumulator

Accumulator for NanMin/NanMax: running extremum plus a flag indicating whether any non-NaN element has been seen. Returns NaN if all elements were NaN.

NanProdDoubleKernel
NanProdFloatKernel
NanProdHalfKernel
NanSquaredDeviationComplexKernel
NanSquaredDeviationDoubleKernel
NanSquaredDeviationFloatKernel
NanSquaredDeviationHalfKernel
NanSumComplexKernel
NanSumDoubleKernel
NanSumFloatKernel
NanSumHalfKernel
StdAxisDoubleKernel
VarAxisDoubleKernel

Interfaces

INDAxisDoubleReductionKernel
INDAxisNumericReductionKernel<T>

Generic numeric axis reduction kernel interface. Used by NDAxisIter for sum, prod, min, max along an axis.

INDAxisSameTypeKernel<T>
INDBooleanReductionKernel<T>
INDInnerLoop

Struct-generic inner loop — zero-alloc alternative to NDInnerLoopFunc. Implementations should be readonly struct; JIT specializes ExecuteGeneric<TKernel>(TKernel) per type and inlines the call.

INDIterKernel

Interface for kernels that work with NDIter.

INDReducingInnerLoop<TAccum>

Reduction variant — the accumulator is threaded through the outer loop so each inner-loop invocation can accumulate into the same scalar. Return false to abort iteration (early exit for Any/All).

Enums

MemOverlap

Result of a memory-overlap query. Matches NumPy's mem_overlap_t (mem_overlap.h:15-21).

NDArrayMethodFlags

Flags characterizing the transfer (cast/copy) functions set up by an iterator. Matches NumPy's NPY_ARRAYMETHOD_FLAGS (dtype_api.h:66).

Packed into the top 8 bits of ItFlags at offset TRANSFERFLAGS_SHIFT (=24). Retrieved via GetTransferFlags() — the preferred way to check whether the iteration can run without the GIL (in NumPy) or might set FP errors.

NDExprReduceKind

Reduction kinds supported by ReduceNode.

NDIterExecutionPath

Execution path for NDIter operations.

NDIterFlags

Iterator-level flags. Conceptually matches NumPy's NPY_ITFLAG_* constants.

NOTE: Bit positions differ from NumPy's implementation:

  • NumPy uses bits 0-7 for IDENTPERM, NEGPERM, HASINDEX, etc.
  • NumSharp reserves bits 0-7 for legacy compatibility flags (SourceBroadcast, SourceContiguous, DestinationContiguous)
  • NumPy-equivalent flags are shifted to bits 8-15

This layout maintains backward compatibility with existing NumSharp code while adding NumPy parity flags. The semantic meaning of each flag matches NumPy, only the bit positions differ.

NDIterGlobalFlags

Global flags passed to iterator construction. Bit values match NumPy's NPY_ITER_* constants exactly (see numpy/_core/include/numpy/ndarraytypes.h).

NDIterOpFlags

Per-operand flags during iteration. Matches NumPy's NPY_OP_ITFLAG_* constants.

NDIterPerOpFlags

Per-operand flags passed to iterator construction. Bit values match NumPy's NPY_ITER_* per-operand constants exactly (see numpy/_core/include/numpy/ndarraytypes.h). All values occupy the high 16 bits per NumPy's NPY_ITER_PER_OP_FLAGS mask (0xffff0000).

NPY_CASTING

Casting rules enumeration matching NumPy's NPY_CASTING.

NPY_ORDER

Iteration order enumeration matching NumPy's NPY_ORDER.

Delegates

NDInnerLoopFunc

Inner-loop callback matching NumPy's PyUFuncGenericFunction. Invoked once per outer iteration; processes count elements starting at dataptrs[op] with per-operand byte stride strides[op].

NDIterGetMultiIndexFunc

Function to get multi-index at current position.

NDIterInnerLoopFunc

Inner loop kernel called by iterator.

NDIterNextFunc

Function to advance iterator to next position. Returns true if more iterations remain.