Namespace NumSharp.Backends.Iteration

Classes

NDExpr: Abstract expression node. Subclasses describe computations over NDIter operands; Compile() produces an NDInnerLoopFunc.

NDExprCompileContext

Flat (1-D, C-order) element iterator — the NumSharp analog of NumPy's flatiter, used by iters.

It wraps an operand already broadcast (via broadcast_to(NDArray, Shape)) to the broadcast result shape, and yields each logical element — in C-order, expanding stride-0 (broadcast) dimensions — exactly like NumPy's broadcast.iters[i]:

// numpy: np.broadcast([1,2,3], [[10],[20]]).iters[0] -> 1,2,3,1,2,3
//                                            .iters[1] -> 10,10,10,20,20,20

The broadcast expansion is the same Shape/stride machinery NDIter uses; element access resolves the (possibly stride-0) coordinates per step, so no buffer is materialized.

NDIter

Static iterator helper methods (backward compatible API).

NUMSHARP DIVERGENCE: These methods support unlimited dimensions via dynamic allocation. Dimension arrays are allocated on demand and freed after use.

NDIterBufferManager: Buffer management for NDIter. Handles allocation, copy-in, and copy-out of iteration buffers.

NDIterCasting: Type casting utilities for NDIter. Validates casting rules and performs type conversions.

NDIterCoalescing

Axis coalescing logic for NDIter. Merges adjacent compatible axes to reduce iteration overhead.

NUMSHARP DIVERGENCE: This implementation supports unlimited dimensions. Uses StridesNDim for stride array indexing (allocated based on actual ndim).

NDIterConstants: NDIter-related bit-packing constants that don't belong on the flag enums.

NDIterExecution: Execution helpers for different paths.

NDIterFlagMasks: Bit masks that partition the NDIter flag space into global (bits 0-15) and per-operand (bits 16-31) regions. Matches NumPy's NPY_ITER_GLOBAL_FLAGS and NPY_ITER_PER_OP_FLAGS macros.

NDIterPathSelector: Execution path selection logic.

NDIterUtils: Helper utilities for NDIter op_axes encoding/decoding.

ReduceNode

UnaryNode

WhereNode

Structs

ComplexAllKernel

ComplexAnyKernel

ComplexArgAccumulator: ArgMin/ArgMax accumulator for Complex — lexicographic best plus the same running-index / NaN-index bookkeeping as HalfArgAccumulator.

ComplexArgMaxKernel

ComplexArgMinKernel

ComplexMaxKernel: Complex max via lexicographic (real, then imaginary) compare. On the first NaN-bearing element the kernel stores it verbatim and aborts (see ComplexMinMaxAccumulator).

ComplexMinKernel

ComplexMinMaxAccumulator: Min/Max accumulator for Complex: the running lexicographic extremum, a "seen any value" flag, and a NaN flag. On the first element whose real OR imaginary part is NaN the kernel stores that element VERBATIM in Best and aborts — matching NumPy's minimum/maximum, which return the NaN-bearing operand as-is (e.g. min([1+1j, nan+0j]) → (nan,0), not (nan,nan)). Because the iterator runs in NPY_KEEPORDER, the element captured is the first NaN in MEMORY order, which is exactly what NumPy's reduce returns for a non-C-contiguous (e.g. transposed) array.

ComplexProdKernel: Complex product. The cross-term multiply (a+bi)(c+di) cannot be expressed as an independent-lane SIMD reduction, so this is a scalar fold; the win over the delegate path is devirtualization + a register-held accumulator.

ComplexSumKernel: Complex sum. When the inner loop is contiguous (stride == 16 bytes = one Complex) and Vector256 is hardware-accelerated, the chunk is summed as a flat double stream with two Vector256<double> lanes (real/imag interleaved survive the lane reduction), then the tail is added scalar. Non-contiguous chunks add scalar. The SIMD reassociation differs from a strict left fold only at ULP level (same class as the codebase's pairwise reductions).

CountNonZeroKernel<T>

CumProdAxisKernel<T>

CumSumAxisKernel<T>

HalfAllKernel

HalfAnyKernel

HalfArgAccumulator: ArgMin/ArgMax accumulator for Half. Cur is the running C-order flat index of the chunk's first element (callers MUST use NPY_CORDER). BestIdx starts at -1 as the "no value yet" sentinel. SawNaNIdx is the flat index of the first NaN (NumPy: argmin/argmax of an array containing NaN return the first NaN's index); -1 until a NaN is seen.

HalfArgMaxKernel

HalfArgMinKernel

HalfMaxKernel: Half max. Contiguous chunks (stride == 2) run a 4-accumulator unroll that breaks the per-element dependency chain (~1.3× the scalar fold); other strides take the scalar branch. NaN propagates: the kernel aborts the moment a NaN is seen so the caller returns Half.NaN.

HalfMinKernel: Half min — mirror of HalfMaxKernel with the comparison and the unroll seeds inverted (+inf).

HalfMinMaxAccumulator: Min/Max accumulator for Half: running extremum (held in double, the precision the codebase's f16 reductions already use), a "seen any value" flag for the empty/first-element guard, and a NaN flag. Any NaN ⇒ result NaN (NumPy: min/max with NaN propagates), so the kernel aborts on the first NaN it sees.

HalfProdKernel

HalfSumKernel

NDAllKernel<T>

NDAnyKernel<T>

NDAxisState

NDIterRef: High-performance multi-operand iterator matching NumPy's nditer API.

NDIterState

Core iterator state with dynamically allocated arrays for both dimensions and operands.

NUMSHARP DIVERGENCE: Unlike NumPy's fixed NPY_MAXDIMS=64 and NPY_MAXARGS=64, NumSharp supports unlimited dimensions AND unlimited operands. All arrays are allocated dynamically based on actual NDim and NOp values.

NDMaxAxisKernel<T>: Max reduction kernel for axis operations.

NDMinAxisKernel<T>: Min reduction kernel for axis operations.

NDProdAxisKernel<T>: Product reduction kernel for axis operations.

NDSumAxisKernel<T>: Sum reduction kernel for axis operations.

NanMaxDoubleKernel

NanMaxFloatKernel

NanMaxHalfKernel

NanMeanAccumulator: Accumulator for nanmean: running sum and count of non-NaN elements.

NanMeanComplexAccumulator: nanmean accumulator for Complex: running Complex sum and non-NaN count.

NanMeanComplexKernel

NanMeanDoubleKernel

NanMeanFloatKernel

NanMeanHalfKernel

NanMinDoubleKernel

NanMinFloatKernel

NanMinHalfKernel

NanMinMaxDoubleAccumulator: Accumulator for NanMin/NanMax on double arrays.

NanMinMaxFloatAccumulator: Accumulator for NanMin/NanMax: running extremum plus a flag indicating whether any non-NaN element has been seen. Returns NaN if all elements were NaN.

NanProdDoubleKernel

NanProdFloatKernel

NanProdHalfKernel

NanSquaredDeviationComplexKernel

NanSquaredDeviationDoubleKernel

NanSquaredDeviationFloatKernel

NanSquaredDeviationHalfKernel

NanSumComplexKernel

NanSumDoubleKernel

NanSumFloatKernel

NanSumHalfKernel

StdAxisDoubleKernel

VarAxisDoubleKernel

Interfaces

INDAxisDoubleReductionKernel

INDAxisNumericReductionKernel<T>: Generic numeric axis reduction kernel interface. Used by NDAxisIter for sum, prod, min, max along an axis.

INDAxisSameTypeKernel<T>

INDBooleanReductionKernel<T>

INDInnerLoop: Struct-generic inner loop — zero-alloc alternative to NDInnerLoopFunc. Implementations should be readonly struct; JIT specializes ExecuteGeneric<TKernel>(TKernel) per type and inlines the call.

INDIterKernel: Interface for kernels that work with NDIter.

INDReducingInnerLoop<TAccum>: Reduction variant — the accumulator is threaded through the outer loop so each inner-loop invocation can accumulate into the same scalar. Return false to abort iteration (early exit for Any/All).

Enums

MemOverlap: Result of a memory-overlap query. Matches NumPy's mem_overlap_t (mem_overlap.h:15-21).

NDArrayMethodFlags

Flags characterizing the transfer (cast/copy) functions set up by an iterator. Matches NumPy's NPY_ARRAYMETHOD_FLAGS (dtype_api.h:66).

Packed into the top 8 bits of ItFlags at offset TRANSFERFLAGS_SHIFT (=24). Retrieved via GetTransferFlags() — the preferred way to check whether the iteration can run without the GIL (in NumPy) or might set FP errors.

NDExprReduceKind: Reduction kinds supported by ReduceNode.

NDIterExecutionPath: Execution path for NDIter operations.

NDIterFlags

Iterator-level flags. Conceptually matches NumPy's NPY_ITFLAG_* constants.

NOTE: Bit positions differ from NumPy's implementation:

NumPy uses bits 0-7 for IDENTPERM, NEGPERM, HASINDEX, etc.
NumSharp reserves bits 0-7 for legacy compatibility flags (SourceBroadcast, SourceContiguous, DestinationContiguous)
NumPy-equivalent flags are shifted to bits 8-15

This layout maintains backward compatibility with existing NumSharp code while adding NumPy parity flags. The semantic meaning of each flag matches NumPy, only the bit positions differ.

NDIterGlobalFlags: Global flags passed to iterator construction. Bit values match NumPy's NPY_ITER_* constants exactly (see numpy/_core/include/numpy/ndarraytypes.h).

NDIterOpFlags: Per-operand flags during iteration. Matches NumPy's NPY_OP_ITFLAG_* constants.

NDIterPerOpFlags: Per-operand flags passed to iterator construction. Bit values match NumPy's NPY_ITER_* per-operand constants exactly (see numpy/_core/include/numpy/ndarraytypes.h). All values occupy the high 16 bits per NumPy's NPY_ITER_PER_OP_FLAGS mask (0xffff0000).

NPY_CASTING: Casting rules enumeration matching NumPy's NPY_CASTING.

NPY_ORDER: Iteration order enumeration matching NumPy's NPY_ORDER.

Delegates

NDInnerLoopFunc: Inner-loop callback matching NumPy's PyUFuncGenericFunction. Invoked once per outer iteration; processes count elements starting at dataptrs[op] with per-operand byte stride strides[op].

NDIterGetMultiIndexFunc: Function to get multi-index at current position.

NDIterInnerLoopFunc: Inner loop kernel called by iterator.