NDIter — kerneling your NDArray with IL generation
NumPy's nditer is the unsung workhorse of NumPy. Every ufunc, every reduction, every broadcasted operation is scheduled by nditer under the covers. It decides which axes to iterate, which to coalesce, whether to buffer, how to walk strided memory — then it hands those decisions to a typed C inner loop generated from C++ templates.
NumSharp has to reach the same destination from the other direction. We have no templates. What we have is System.Reflection.Emit.DynamicMethod and a JIT that eagerly autovectorizes tight loops. This page explains how NumSharp's port of nditer (NDIter) works, why we diverge from NumPy in a few places, and — most importantly — how NDIter.Execution.cs glues the iterator to ILKernelGenerator so a single call like ExecuteBinary(Add) cashes out to the same kind of native SIMD loop that NumPy's C++ emits at compile time, but generated at your first call and cached forever after.
Read this page end-to-end if you're writing a new np.* function, porting a ufunc, or trying to squeeze more performance out of an existing operation.
Table of Contents
- Overview
- Public Iteration Surface
- What NDIter Is
- Divergences from NumPy
- Iterator State
- Construction
- Coalescing, Reordering, and Flipping
- Memory Overlap and COPY_IF_OVERLAP
- Iteration Mechanics
- Buffering
- Buffered Reduction: The Double Loop
- Kernel Integration Layer
- Quick reference
- Decision tree
- Measured behavior
- Cache state — two lifetimes to know about
- Layer 1 — Canonical Inner-Loop API
- Layer 2 — Struct-Generic Dispatch
- Layer 3 — Typed ufunc Dispatch
- Custom Operations (Tier 3A / 3B / 3C)
- Path Detection
- Worked Examples
- Performance
- Summary
Overview
What Is An Iterator?
An array is just a pointer plus a shape plus strides. Iterating "through" it means producing, one element (or chunk of elements) at a time, the byte offset into the buffer. For a contiguous row-major 3×4 array this is trivial — walk from 0 to 11 with stride 1. For a transposed view, a sliced view, a broadcasted view, or two arrays with mismatched strides, it is not.
NDIter takes that tangle and produces a single linear schedule of pointer advances. Once you have it, you can write one loop — do { kernel(dataptrs, strides, count); } while (iternext); — and it runs correctly for every memory layout NumSharp supports.
Why Build Our Own?
NumPy's nditer is C99 with templates mixed in through macro expansion. We can't take it verbatim. At the same time we want every one of its capabilities: coalescing, reordering, negative-stride flipping, ALLOCATE, COPY_IF_OVERLAP, buffered casting, buffered reduction with the double-loop trick, C/F/K ordering, per-operand flags, op_axes with explicit reduction encoding. These are features users rely on without realizing it — np.sum(a, axis=0) quietly benefits from four of them.
NumSharp implements all of it in managed code with NativeMemory.AllocZeroed for unmanaged state and ILKernelGenerator for the typed inner loops. The bridge that wires them together is NDIter.Execution.cs, which this page centers on.
Public Iteration Surface
NDIterRef is the kernel-author engine — the rest of this page is about it. Most application code never constructs one directly. The public, user-facing ways to walk an array are:
| You want | Use | Notes |
|---|---|---|
| Every element of one array, C-order | foreach (var x in ndarray) or ndarray.GetAtIndex(i) |
Resolves slices, strides, offset, and stride-0 broadcast; no intermediate copy. |
| Each operand of a broadcast, flattened | np.broadcast(a, b, …).iters[i] — a NDFlatIterator |
One flat C-order stream per operand, each stretched to the broadcast shape (e.g. np.broadcast([1,2,3], [[10],[20]]).iters[0] yields 1,2,3,1,2,3). |
| The broadcast itself, as tuples | foreach (object[] vals in np.broadcast(a, b, …)) |
NumPy's np.broadcast object: one per-operand value tuple per step, with a live .index cursor, .numiter, .size, and .reset(). |
There is no separate per-element iterator class — NDIter drives every kernel, and the element-level surface above is built on the same Shape/stride machinery (GetAtIndex). NDFlatIterator (NumSharp.Backends.Iteration) is the small public analog of NumPy's flatiter; it is re-enumerable, unlike NumPy's one-shot flatiters.
Like NumSharp's NDIter, np.broadcast(...) accepts any number of operands — NumPy caps the multi-iterator at 64 (NPY_MAXARGS); NumSharp does not (see Divergences from NumPy).
What NDIter Is
NDIter is a ref partial struct living in NumSharp.Backends.Iteration. Concretely:
NDIterRef (ref partial struct) ← public handle (~3000 lines across 2 partials)
├── _state: NDIterState* ← heap-allocated unmanaged state
├── _operands: NDArray[] ← kept alive by GC root
└── _cachedIterNext: NDIterNextFunc? ← memoized iterate-advance delegate
NDIterState (unmanaged struct) ← ~30 fields, all dynamically sized
├── Scalars: NDim, NOp, IterSize, IterIndex, ItFlags, ...
├── Dim arrays (size = NDim): Shape*, Coords*, Strides*, Perm*
├── Op arrays (size = NOp): DataPtrs*, ResetDataPtrs*, BufStrides*,
│ InnerStrides*, BaseOffsets*, OpDTypes*, ...
└── Reduction arrays: ReduceOuterStrides*, ReduceOuterPtrs*,
ArrayWritebackPtrs*, CoreSize, CorePos, ...
The public struct is cheap to pass around; the heavy state lives behind one pointer so we can allocate it exactly once, on the heap, sized to the problem. Dispose frees it.
The Files
| File | What lives there |
|---|---|
NDIter.cs |
Construction, iteration wrappers, debug dump, Copy, Dispose (~3000 lines) |
NDIter.State.cs |
NDIterState definition, allocation, Advance, Reset, GotoIterIndex, BufferedReduceAdvance |
NDIter.Execution.cs |
Kernel integration layer — ForEach, ExecuteGeneric, Execute{Binary,Unary,Reduction,Comparison,Scan,Copy} (~600 lines) |
NDIterFlags.cs |
NDIterFlags, NDIterOpFlags, NDIterGlobalFlags, NDIterPerOpFlags, casting/order enums |
NDIterCoalescing.cs |
CoalesceAxes, ReorderAxesForCoalescing, FlipNegativeStrides |
NDIterCasting.cs |
Safe/same-kind/unsafe cast rules, ConvertValue, FindCommonDtype |
NDIterBufferManager.cs |
Aligned buffer allocation, copy-in/copy-out, GROWINNER, BUF_REUSABLE |
NDIter.Reduce.cs |
NewReduce — builds the reduction iterator (stride-ordered op_axes, stride-0 output axis) |
NDIter.Execution.Custom.cs |
Custom-op tiers: ExecuteRawIL (3A), ExecuteElementWise (3B), ExecuteExpression (3C) |
NDMemOverlap.cs |
Diophantine memory-overlap solver for COPY_IF_OVERLAP (port of NumPy's mem_overlap.c) |
NDExpr.cs, NDExpr.Typing.cs, NDExpr.Evaluate.cs |
Tier 3C / np.evaluate expression tree, per-node NEP50 typing, lowering to one pass |
NDScalarReductionKernels.cs, NDNanReductionKernels.cs |
Half/Complex + NaN-aware scalar (axis=None) reduction kernel structs |
NDAxisIter.cs, NDAxisIter.State.cs |
Single-axis reduction iterator for var / std / cumsum / cumprod |
NDLogicalReductionKernels.cs |
All / Any reduction kernel structs |
Divergences from NumPy
NumPy's nditer has two hard-coded limits that NumSharp drops:
| Limit | NumPy | NumSharp |
|---|---|---|
NPY_MAXDIMS |
64 | unlimited (dynamic alloc, soft limit ≈ 300k from stackalloc) |
NPY_MAXARGS |
64 | unlimited (dynamic alloc) |
NumPy uses fixed arrays inside NDIter_InternalIterator. NumSharp allocates everything via NativeMemory.AllocZeroed sized to the actual (ndim, nop) the caller passes. The trade is marginally more setup cost in exchange for no artificial ceilings and no wasted memory on a 2-operand 1-D iter.
Other deliberate differences:
- Flag bit layout. NumSharp reserves low bits 0-7 for the compatibility flags (
SourceBroadcast,SourceContiguous,DestinationContiguous). NumPy-parity flags (IDENTPERM,HASINDEX,REDUCE, ...) sit at bits 8-15. Transfer flags pack into the top byte at shift 24. Semantics match NumPy; positions do not. - Element strides everywhere internally. NumPy stores byte strides in
NAD_STRIDES. NumSharp stores element strides instate.Stridesand multiplies byElementSizes[op]at use. This matches NumSharp'sShape.stridesconvention. - No Python object support.
REFS_OK, garbage collection hooks, andNDIter_GetBufferNeedsAPIare no-ops. All cast routines are written assuming the data is plain unmanaged bytes. - Int64 indexing. Every iteration counter is
long. Arrays > 2 GB are first-class, unlike NumPy which still usesnpy_intp(platform-dependent). - NumSharp-only iterator flag.
PARALLEL_SAFEmarks an iteration range as splittable across workers with no write hazard — set when there is noREDUCEoperand and at most one write operand whose overlap was resolved byCOPY_IF_OVERLAP. It has no NumPy equivalent.
Iterator State
A couple of fields deserve a closer look because every later section refers to them.
Shape, Coords, Strides
public long* Shape; // [NDim] — post-coalesce dimension sizes
public long* Coords; // [NDim] — current position, 0..Shape[d]
public long* Strides; // [NOp * NDim] — element stride per (op, axis)
public sbyte* Perm; // [NDim] — Perm[internal] = original_axis
// negative means axis was flipped
After coalescing, NDim can shrink. StridesNDim captures the stride allocation width so GetStride(axis, op) = Strides[op * StridesNDim + axis] still works.
Perm[internal_axis] = original_axis records how internal axes relate to the axes the caller passed in. If FlipNegativeStrides rewrote an axis, Perm[d] = -1 - original_axis encodes the flip. GetMultiIndex uses Perm to translate internal coords back into caller-space.
DataPtrs vs ResetDataPtrs vs BaseOffsets
public long* ResetDataPtrs; // base pointer per operand; start of iteration
public long* BaseOffsets; // byte accumulator from FlipNegativeStrides
public long* DataPtrs; // live pointer; moves every Advance()
Reset() copies ResetDataPtrs into DataPtrs. When the iterator flips an axis it walks the data pointer to the end-of-axis first (since we'll iterate backwards in original memory, forwards in flipped-coord space) and records the byte delta in BaseOffsets. ResetBasePointers(newPtrs) lets the caller swap the array out while keeping the iteration schedule: new reset = new base + stored offset.
Buffering Fields
public long BufferSize; // elements per operand buffer (default 8192)
public long BufIterEnd; // how far into the buffer we're iterating
public long* Buffers; // aligned-64 buffer pointer per operand (0 = no buffer)
public long* BufStrides; // inner-loop stride per operand in bytes
// == ElementSizes[op] for buffered operands
When buffering is active, an operand's DataPtrs[op] points into Buffers[op], not into the original NDArray. The kernel sees a contiguous buffer at the buffer dtype; NDIterBufferManager handles the strided copy-in and copy-out.
Reduction Fields (double-loop)
public int OuterDim; // which internal axis is the reduce axis
public long CoreSize; // elements per output slot (inner-loop length)
public long CorePos; // position within core, 0..CoreSize
public long ReduceOuterSize; // number of output slots in current buffer
public long ReducePos; // position within outer loop
public long* ReduceOuterStrides; // stride per op, advances to next output slot
public long* ReduceOuterPtrs; // saved pointer at start of current output slot
public long* ArrayWritebackPtrs; // array-space pointer for flushing output buffer
These only come into play when the iterator has both BUFFER and REDUCE flags. They're explained in detail in Buffered Reduction: The Double Loop.
Construction
Creating an iterator looks like this:
using var iter = NDIterRef.MultiNew(
nop: 3,
op: new[] { a, b, out },
flags: NDIterGlobalFlags.EXTERNAL_LOOP | NDIterGlobalFlags.BUFFERED,
order: NPY_ORDER.NPY_KEEPORDER,
casting: NPY_CASTING.NPY_SAFE_CASTING,
opFlags: new[] {
NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.WRITEONLY },
opDtypes: new[] { NPTypeCode.Double, NPTypeCode.Double, NPTypeCode.Double });
Behind the scenes:
1. Pre-check WRITEMASKED/ARRAYMASK pairing (state-free validation)
2. Resolve broadcast shape (ResolveReturnShape; respects op_axes)
3. Allocate ALLOCATE operands with result dtype
4. state.AllocateDimArrays(ndim, nop) (one big NativeMemory.AllocZeroed)
5. Set MaskOp from ARRAYMASK flag
6. Find common dtype if COMMON_DTYPE
7. For each operand:
- SetOpSrcDType (array dtype)
- SetOpDType (buffer dtype; equals array dtype when not casting)
- Translate NDIterPerOpFlags → NDIterOpFlags
- Mark CAST if dtypes differ
- Compute strides (respecting op_axes or broadcast)
- Set data pointer = arr.Address + offset * elemSize
- Mark SourceBroadcast if any dim has stride 0 with Shape > 1
8. Validate casting requires BUFFERED flag
9. NDIterCasting.ValidateCasts(ref state, casting)
10. Apply op_axes reduction flags (detects implicit + explicit reduction axes)
11. FlipNegativeStrides (K-order only; skipped for C/F/A)
12. If NDim > 1: ReorderAxesForCoalescing → CoalesceAxes
(but only when MULTI_INDEX and C_INDEX/F_INDEX are both off)
13. Set EXLOOP, GROWINNER, HASMULTIINDEX, HASINDEX flags per request
14. InitializeFlatIndex() if HASINDEX
15. UpdateInnerStrides() (cache inner stride per op for fast access)
16. UpdateContiguityFlags() (sets CONTIGUOUS if every operand is contiguous)
17. If BUFFERED: allocate buffers, prime them with CopyToBuffer
18. If BUFFERED + REDUCE: SetupBufferedReduction (double-loop)
19. If IterSize <= 1: set ONEITERATION
The result is a state machine ready to produce pointers.
The Flag Families
There are four mostly-disjoint flag enums. A quick reference:
NDIterGlobalFlags — passed at construction, affect the whole iterator.
| Flag | Meaning |
|---|---|
C_INDEX, F_INDEX |
Track a flat index in C or F order |
MULTI_INDEX |
Track per-dim coords (needed for GetMultiIndex) |
EXTERNAL_LOOP |
Caller handles inner dim — iterator returns inner-dim-sized chunks |
COMMON_DTYPE |
Find common dtype across all operands and cast to it |
REDUCE_OK |
Allow reduction operands (needed for axis reductions) |
BUFFERED |
Enable operand buffering (required with cross-type casting) |
GROWINNER |
Make inner loop as large as possible within buffer |
DELAY_BUFALLOC |
Defer buffer alloc until first Reset |
DONT_NEGATE_STRIDES |
Suppress FlipNegativeStrides |
COPY_IF_OVERLAP |
Copy operand if it overlaps another in memory |
RANGED |
Iterator covers a sub-range |
NDIterPerOpFlags — passed per operand, affect just that operand.
| Flag | Meaning |
|---|---|
READONLY, WRITEONLY, READWRITE |
Direction |
COPY, UPDATEIFCOPY |
Force copy / update on dealloc |
ALLOCATE |
op[i] is null — iterator allocates using opDtypes[i] |
CONTIG |
Require contiguous view (may force buffering) |
NO_BROADCAST |
Error if this operand would need to broadcast |
WRITEMASKED, ARRAYMASK |
Writemask pair for masked writes |
NDIterFlags — internal state, set/cleared during iteration. (IDENTPERM, NEGPERM, HASINDEX, BUFFER, REDUCE, ONEITERATION, etc.) These flow from construction decisions.
NDIterOpFlags — per-operand internal state. (READ, WRITE, CAST, REDUCE, VIRTUAL, WRITEMASKED, BUF_REUSABLE, CONTIG.)
Coalescing, Reordering, and Flipping
The single biggest performance lever the iterator has is reducing NDim. A 3-D contiguous array should iterate in one flat loop, not in three nested ones.
Coalescing Rule
Two adjacent axes d and d+1 can merge if, for every operand:
stride[op][d] * shape[d] == stride[op][d+1]
...or either axis is size 1 with stride 0 (broadcast pass-through). When that holds, the pair is collapsed: the new shape is shape[d] * shape[d+1], the new stride is stride[op][d] (the inner one).
A contiguous 2×3×4 float32 array has strides [12, 4, 1] in elements. The coalescing check succeeds at both boundaries, and CoalesceAxes reduces NDim from 3 to 1 with shape 24 and stride 1. One flat SIMD loop, exactly.
Reordering
Coalescing only works if adjacent axes are already stride-ordered. ReorderAxesForCoalescing sorts axes by minimum absolute stride (smallest innermost) when the requested order allows it:
C-order: last axis innermost (no reorder — identity perm)
F-order: first axis innermost (reverse axes)
K-order: smallest stride innermost (insertion sort by stride)
A-order: behaves like K-order
For K-order on a non-contiguous broadcast array, stride-based sorting produces the wrong iteration order, so the iterator falls back to C-order. This guard rail lives in the construction logic around effectiveOrder.
Negative-Stride Flipping
FlipNegativeStrides only runs under K-order (not C/F/A — those are "forced orders" that preserve logical iteration direction). For each axis where all operands have zero or negative strides, the iterator:
- Negates the stride.
- Accumulates
(shape[d] - 1) * old_stride * elem_sizeintoBaseOffsets[op]. - Marks the axis flipped via
Perm[d] = (sbyte)(-1 - Perm[d]).
The effect: a reversed slice still iterates contiguous memory in ascending order, which the SIMD kernels can chew on. Later, GetMultiIndex decodes the flip so the caller sees original coordinates.
Interaction with MULTI_INDEX and HASINDEX
If MULTI_INDEX is set we reorder but don't coalesce — coalescing would lose the mapping from internal to original axes. Same for C_INDEX/F_INDEX, which need original axis structure to compute the flat index.
Memory Overlap and COPY_IF_OVERLAP
When an output operand shares memory with an input — np.add(a, a, out=a), b[1:] += b[:-1], a transposed view written back over its source — a naive iterator can clobber input elements before they're read, producing undefined results. NumPy guards this with NPY_ITER_COPY_IF_OVERLAP; NumSharp ports the guard, solver and all.
The flag's contract
Pass NDIterGlobalFlags.COPY_IF_OVERLAP and, for each write operand, the constructor checks it against every read operand. If they may share memory, the write operand is replaced with a fresh C-order temporary; the kernel writes into the temp, and the temp is copied back over the original when the iterator is disposed. Read operands are never copied — only the writers that would corrupt a reader. This runs in Initialize before the ALLOCATE step, so an iterator-allocated output can never be made to overlap an input.
The overlap solver (NDMemOverlap.cs)
"May these two views share memory?" is not the bounding-box question it looks like. a[::2] and a[1::2] occupy the same address range yet never touch the same byte. The exact test is a bounded Diophantine equation — the two arrays overlap iff
Σ strideₐ·xₐ − Σ strideᵦ·xᵦ = baseᵦ − baseₐ (0 ≤ x[i] < shape[i])
has an integer solution. NDMemOverlap is a faithful port of NumPy's numpy/_core/src/common/mem_overlap.c (Pauli Virtanen's solver), with System.Int128 standing in for NumPy's npy_extint128.h:
- Bounds quick-check.
GetMemoryExtentscomputes each array's half-open[start, end)byte range. Disjoint ranges →Nooverlap, no further work (this is NumPy'sMAY_SHARE_BOUNDS, i.e.maxWork == 0). - Diophantine solve. Overlapping bounds fall through to
SolveDiophantine(an extended-Euclid GCD chain plus a depth-first bounded search), which decides whether a real shared element exists. The search is work-capped bymaxWork: if the cap is hit the result isTooHard, which callers conservatively treat as "may share."
SolveMayShareMemory(a, b, maxWork) answers the two-array question; SolveMayHaveInternalOverlap(a, maxWork) answers "does one array alias itself?" (a stride-0 dim over a non-unit axis). The result enum is { No, Yes, TooHard, Overflow } — anything but No means "copy to be safe."
Resolution and writeback
ComputeCopyIfOverlap runs the write-vs-read matrix with maxWork = 1 — exactly NumPy's budget. Operands flagged for copy go through MaterializeForcedCopies: each is swapped for a C-order temp (copied in first if it's also read), tagged FORCECOPY | HAS_WRITEBACK, and the original recorded in _writebackOriginals. On Dispose(), ResolveWritebacks copies each temp back over its original — but only after FlushBufferWindow, so buffered writes reach the temp before the temp lands in the user's array. Get that ordering wrong and a buffered in-place op silently drops its last window.
The elementwise short-circuit
Forcing a copy for every in-place out= would be wasteful — np.add(a, a, out=a) is perfectly safe because the inner loop reads each element before it writes the same slot. The per-operand flag OVERLAP_ASSUME_ELEMENTWISE encodes that promise: when both operands carry it, are the same dtype, and have identical base/shape/strides with no internal self-overlap, AssumeElementwiseExactAlias skips the copy. Every ufunc out= path sets this flag, so the common in-place case pays nothing.
This is also the link to PARALLEL_SAFE: the same forced-copy machinery that makes a single in-place writer correct is what certifies that writer for parallel range-splitting (see Divergences from NumPy).
Iteration Mechanics
GetIterNext() returns the advance function for the current flag set. For unbuffered iteration there are three:
| Flavor | Picked when | Behavior |
|---|---|---|
SingleIterationNext |
ONEITERATION |
One shot, done |
ExternalLoopNext |
EXLOOP |
Advance outer coords only; inner dim is the caller's problem |
StandardNext |
otherwise | Full ripple-carry advance, one element at a time |
Buffered iteration adds two more: BufferedNext (windowed, with EXTERNAL_LOOP) jumps a whole buffer fill per call, and BufferedElementNext (without EXTERNAL_LOOP) walks the buffer one element at a time, refilling at the window edge. A buffered reduction uses the double-loop advance instead — see Buffering and Buffered Reduction.
state.Advance() is the ripple-carry primitive. For each axis from innermost to outermost:
for axis in (NDim-1 ... 0):
coord[axis]++
if coord[axis] < shape[axis]:
dataptrs[op] += stride[op][axis] * elem_size[op] for every op
return
// carry: reset this axis
coord[axis] = 0
dataptrs[op] -= stride[op][axis] * (shape[axis] - 1) * elem_size[op]
// fell through: iteration complete
Straightforward, but note the rewind on carry: when axis 2 wraps, we subtract stride*(shape-1)*size so the pointer lands back at the axis-2 start, then axis 1 will add one stride. The net effect is identical to dataptr = base + sum(coord[d] * stride[d][op]) * size, but computed incrementally.
GetInnerLoopSizePtr()
Ideally the inner loop processes many elements per iternext call. The iterator exposes this via:
long* size = iter.GetInnerLoopSizePtr();
- Windowed buffered (
BUFFER, noREDUCE): returns&state.BufTransferSize— the element count of the current buffer fill (NumPy'sNBF_SIZE). - Buffered reduction (
BUFFER + REDUCE): returns&state.BufIterEnd, which the double-loop uses as the inner-loop count. - Otherwise: returns
&state.Shape[NDim-1](the innermost dimension size).
With EXTERNAL_LOOP set and the array coalesced to 1-D, one iternext call returns the entire array size — a single kernel invocation processes everything.
Buffering
Buffering solves two problems:
- Casting. If the caller wants to see doubles but the NDArray is int32, the iterator copies into a double buffer, runs the kernel against the buffer, writes back on dispose.
- Non-contiguous + SIMD. If the operand is strided (sliced, transposed), copying to a contiguous buffer lets a SIMD kernel work efficiently.
NDIterBufferManager.AllocateBuffers allocates 64-byte-aligned blocks (AVX-512-friendly) per operand that needs buffering. Default buffer size is 8192 elements; this can be tuned per call.
strided array (stride=5, size=24) aligned 64-byte buffer (size ≤ 8192)
┌─────┬─────┬─────┬─────┐ ┌──┬──┬──┬──┬──┬──┬──┐
│ a[0]│ ? │ ? │ ? │ CopyToBuffer │a0│a5│a10│... │
│ ? │ ? │ ? │ a[5]│ ────────▶ └──┴──┴──┴──┴──┴──┴──┘
│ ? │ ? │a[10]│ ? │ ^
│ ... │ DataPtrs[op] points here
└─────────────────────┘ BufStrides[op] = sizeof(T)
Once the buffer is filled, DataPtrs[op] moves into the buffer and every inner-loop kernel treats it as a flat contiguous array. When iteration advances past BufIterEnd, NDIterBufferManager.CopyFromBuffer writes output back into the original array (respecting original strides) and CopyToBuffer refills input buffers for the next chunk.
GROWINNER
GROWINNER requests that the inner loop grow to consume as much of the buffer as possible per kernel call — more work per invocation, less loop overhead. In NumSharp the inner-loop length is set by axis coalescing together with the buffer-window size: a contiguous array coalesces to a single axis (a 5×6 contiguous array becomes one length-30 axis), so the inner loop already spans the whole contiguous run, and the buffer window is sized identically with or without the flag. The flag is carried for NumPy API parity.
BUF_REUSABLE
For reductions, the same input block may be read multiple times (e.g. mean when accumulator type differs). The BUF_REUSABLE flag tells the iterator "the buffer contents are still valid, skip the copy." CopyToBufferIfNeeded honors it.
Buffered Reduction: The Double Loop
When you do np.sum(a, axis=0) on a 2-D array, the output has one fewer axis than the input. The iterator must visit every input but accumulate into a fixed output position while the reduction axis is scanned. The efficient way to do this with buffering is NumPy's double loop:
CoreSize = length of reduce axis ("how many inputs per output")
ReduceOuterSize = other-axes length fitted into buffer ("how many output slots")
For each buffer fill:
for outer in 0..ReduceOuterSize: ← advance ReduceOuterPtrs by ReduceOuterStrides
for core in 0..CoreSize: ← advance DataPtrs by BufStrides
kernel(dataptrs, bufstrides, 1) ← accumulate into output
// reset inner, move outer pointer to next output slot
The trick: reduce operands have BufStrides[op] = 0, so inside the core loop their pointer stays pinned. The kernel keeps adding into the same output slot until the reduce axis is exhausted; the outer loop then moves to the next output slot.
NDIterState.BufferedReduceAdvance() returns:
1— more elements in current buffer (inner or outer)0— buffer exhausted, caller must refill-1— iteration complete, caller must flush
The bridge's BufferedReduce method drives this explicitly.
IsFirstVisit
Reduction kernels must initialize the output before accumulating. iter.IsFirstVisit(op) returns true only when every reduction-axis coordinate is zero and CorePos == 0 in buffered mode. Kernels check this once at each output slot to emit identity-write semantics:
if (iter.IsFirstVisit(reduceOp)) *(double*)ptrs[reduceOp] = 0.0;
*(double*)ptrs[reduceOp] += *(double*)ptrs[inputOp];
Kernel Integration Layer
Everything up to this point describes NDIter's scheduling machinery. What NDIter.Execution.cs adds is the connection between that schedule and the SIMD kernels ILKernelGenerator emits.
The layer is a partial declaration of NDIterRef that exposes seven entry points arranged along an ergonomics-vs-control axis. Pick the one that matches your use case; they all share the same compiled-kernel cache and all run through the same ForEach driver at the bottom.
ergonomics control
▲ ▲
│ │
Layer 3 │ ExecuteBinary / Unary / Reduction / Comparison / Scan │ 90% case
│ "one call, NumPy-style — one line per op" │
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Tier 3C │ ExecuteExpression(NDExpr) │ compose
│ "build a tree with operators; no IL in caller" │ with DSL
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Tier 3C+Call │ NDExpr.Call(Math.X / Func / MethodInfo, args) │ inject any
│ "invoke arbitrary managed method per element" │ BCL / user op
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Tier 3B │ ExecuteElementWiseBinary(scalarBody, vectorBody) │ hand-tune
│ "write per-element IL; factory wraps the unroll shell" │ the vector body
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Tier 3A │ ExecuteRawIL(emit, key, aux) │ emit
│ "emit the whole inner-loop body including ret" │ everything
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Layer 2 │ ExecuteGeneric<TKernel> / ExecuteReducing<TKernel, TAccum> │ struct-
│ "zero-alloc; JIT specializes per struct; early-exit reduce" │ generic
────────── │ ───────────────────────────────────────────────────────── │ ──────────
Layer 1 │ ForEach(NDInnerLoopFunc kernel, void* aux) │ delegate,
│ "closest to NumPy's C API; closures welcome" │ anything goes
│ │
▼ ▼
NDIter state (Shape, Strides, DataPtrs, Buffers, ...)
│
▼
ILKernelGenerator (DynamicMethod + V128/V256/V512)
Quick reference
| # | Entry point | When to reach for it | Per-call cost |
|---|---|---|---|
| 1 | ExecuteBinary / Unary / Reduction / Comparison / Scan |
The op is a standard NumPy ufunc. 90% of cases. | Cache hit after first call |
| 2 | ExecuteExpression(NDExpr) |
Compose a fused ufunc from DSL nodes (Add, Sqrt, Where, Exp, comparisons, Min/Max/Clamp, …). SIMD when dtypes align. |
Cache hit after first compile |
| 3 | ExecuteExpression(NDExpr.Call(...)) |
DSL doesn't expose the op you want (Math.BitIncrement, custom activation, reflected plugin method). |
+5-10 ns / element for non-static delegates |
| 4 | ExecuteElementWiseBinary / Unary / Ternary / ExecuteElementWise (array form) |
You want SIMD + 4× unroll for a fused or non-standard op; the DSL doesn't compose to it, but the loop shape is still element-wise. Hand-write the scalar + vector body. | Cache hit after first compile |
| 5 | ExecuteRawIL(emit, key, aux) |
Non-rectangular loop: gather/scatter, cross-element deps, branch-on-auxdata. You emit every opcode. | Cache hit after first compile |
| 6 | ExecuteGeneric<TKernel> / ExecuteReducing<TKernel, TAccum> |
Custom kernel in struct form. Zero allocation; JIT specializes. Only path with early-exit reductions. | No delegate indirection |
| 7 | ForEach(NDInnerLoopFunc) |
Exploratory; one-off fused kernels; anything a closure makes natural. | Delegate allocation per call |
Decision tree
Is the op a standard NumPy ufunc already in ExecuteBinary/Unary/Reduction?
yes → Layer 3 (baked). Fastest, zero work. Done.
no ↓
Can I express it as a tree of DSL nodes (Add, Sqrt, Where, Exp, …)?
yes → Tier 3C. Fused, SIMD-or-scalar automatic, no IL.
no ↓
Is the missing piece a BCL method (Math.X, user activation, reflected plugin)?
yes → Tier 3C + Call. Scalar-only but fused. Done.
no ↓
Do I need V256/V512 intrinsics the DSL doesn't wrap (Fma, Shuffle, Gather, …)?
yes → Tier 3B. Hand-write the vector body; factory wraps the shell.
no ↓
Is the loop shape non-rectangular (gather/scatter, cross-element deps)?
yes → Tier 3A. Emit the whole inner-loop IL yourself.
no ↓
Do I need an early-exit reduction (Any / All / find-first)?
yes → Layer 2 ExecuteReducing. Returns false from the kernel to bail out.
no ↓
Just exploring or writing a one-off?
→ Layer 1 ForEach. Delegate per call; flexible.
Measured behavior
Benchmarked on 1M-element arrays, post-warmup, via the showcase script in this doc's /demos/ sibling (not checked in — recreate with the snippet in each tier's section below):
| Technique | Operation | Time / run | Notes |
|---|---|---|---|
| Layer 3 | a + b (f32) |
0.58 ms | baked, 4×-unrolled V256, cache hit |
| Tier 3B | 2a + 3b hand V256 (f32) |
0.61 ms | within ~7% of baked — same shell |
| Layer 2 reduction | AnyNonZero early-exit (hit @ 500) |
0.001 ms | returns false from kernel, bridge bails |
| Tier 3A | abs(a - b) raw IL (i32) |
1.27 ms | scalar loop, JIT autovectorizes post tier-1 |
| Call | GELU via captured lambda (f64) |
8.08 ms | Math.Tanh dominates |
| Tier 3C | stable sigmoid via Where (f64) |
13.6 ms | 3 × Math.Exp per element |
Layer 1 and Layer 2 element-wise kernels have a tier-0 JIT characteristic: when run from a dynamic host (ephemeral script, dotnet_run, first-call cold start) they can measure 30-50× slower than production code. Post-tier-1 promotion (~100 hot-loop iterations) brings them within 2-3 ms for hypot on 1M f32. See JIT warmup and measurement.
Cache state — two lifetimes to know about
The full integration layer shares two process-lifetime caches. The kernel-cache counts are public read-only observability on GeneratedDelegates; resetting a cache is an internal test-only hook (needs [InternalsVisibleTo] or the AssemblyName=NumSharp.DotNetRunScript script directive):
int kernels = GeneratedDelegates.InnerLoopCount; // compiled DynamicMethods (public)
int slots = DelegateSlots.RegisteredCount; // registered delegates + targets (public)
GeneratedDelegates.ClearInnerLoop(); // internal, test-only
DelegateSlots.Clear(); // test-only — pair with above!
After running the full showcase (Layer 3 + Tiers A-C + Call across 130 warmup+timed iterations), typical counts are:
GeneratedDelegates.InnerLoopCount = 4 ← one per unique cache key across all tiers
DelegateSlots.RegisteredCount = 131 ← one per Call(lambda) construction
The 131 reflects the behavior described in the Memory model and lifetime section — every NDExpr.Call(lambda, …) constructor call re-registers the delegate, even if the kernel is reused via an explicit cacheKey. To keep slot growth flat, register delegates once at startup (static readonly Func<…>); see the registration-once pattern.
Layer 1 — Canonical Inner-Loop API
This is the NumPy-in-C pattern. You hand the iterator a function pointer (a delegate in C#), and it runs the canonical loop:
public void ForEach(NDInnerLoopFunc kernel, void* auxdata = null);
public unsafe delegate void NDInnerLoopFunc(
void** dataptrs, long* strides, long count, void* auxdata);
One call per inner loop, not per element. The iterator decides what "inner loop" means:
| Scenario | Call count | Count per call |
|---|---|---|
Fully coalesced + contiguous, with EXTERNAL_LOOP |
1 | IterSize |
Non-coalesced with EXTERNAL_LOOP |
outer product | Shape[NDim-1] |
| Buffered | ceil(IterSize / BufferSize) |
BufIterEnd |
Neither EXTERNAL_LOOP nor BUFFERED |
IterSize |
1 |
The strides passed to the kernel are always in bytes — the bridge converts from element strides for the non-buffered path. This matches NumPy's convention and makes the kernel body identical whether or not the iterator is buffering.
Performance note. Post-tier-1 the JIT autovectorizes both byte-pointer and typed-pointer loops into Vector256. To get there faster and to keep the fast path as simple as possible, branch on stride at the top and drop to typed pointers:
using var iter = NDIterRef.MultiNew(3, new[] { a, b, c },
NDIterGlobalFlags.EXTERNAL_LOOP,
NPY_ORDER.NPY_KEEPORDER,
NPY_CASTING.NPY_NO_CASTING,
new[] { NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.WRITEONLY });
iter.ForEach((ptrs, strides, count, _) => {
// Fast branch: contiguous, element stride == sizeof(float).
// The JIT autovectorizes this to Vector256 sqrt.
if (strides[0] == 4 && strides[1] == 4 && strides[2] == 4) {
float* a = (float*)ptrs[0], b = (float*)ptrs[1], c = (float*)ptrs[2];
for (long i = 0; i < count; i++)
c[i] = MathF.Sqrt(a[i] * a[i] + b[i] * b[i]);
return;
}
// Slow branch: strided / broadcast. Correct but scalar.
long sA = strides[0], sB = strides[1], sC = strides[2];
byte* pA = (byte*)ptrs[0]; byte* pB = (byte*)ptrs[1]; byte* pC = (byte*)ptrs[2];
for (long i = 0; i < count; i++) {
float av = *(float*)(pA + i * sA);
float bv = *(float*)(pB + i * sB);
*(float*)(pC + i * sC) = MathF.Sqrt(av * av + bv * bv);
}
});
Use this when you're writing a one-off operation that doesn't fit the standard ufunc shape, or when you want to fuse several operations into a single pass to avoid temporaries.
Layer 2 — Struct-Generic Dispatch
Delegates have an indirect call. For hot inner loops, that hurts. Layer 2 trades a delegate for a struct type parameter:
public interface INDInnerLoop
{
void Execute(void** dataptrs, long* strides, long count);
}
public interface INDReducingInnerLoop<TAccum> where TAccum : unmanaged
{
bool Execute(void** dataptrs, long* strides, long count, ref TAccum accumulator);
}
public void ExecuteGeneric<TKernel>(TKernel kernel)
where TKernel : struct, INDInnerLoop;
public TAccum ExecuteReducing<TKernel, TAccum>(TKernel kernel, TAccum init)
where TKernel : struct, INDReducingInnerLoop<TAccum>
where TAccum : unmanaged;
Because TKernel is constrained to struct, the JIT specializes one copy of ExecuteGeneric per struct type at codegen time and inlines kernel.Execute(...) at the call site. No vtable, no delegate, no boxing. It's the closest managed C# gets to C++ templates.
The bridge splits ExecuteGeneric internally so the single-inner-loop case (the common case: coalesced contig + EXTERNAL_LOOP, ONEITERATION, or buffered-fits-in-one-fill) goes through ExecuteGenericSingle — a tiny [AggressiveInlining] method with one kernel.Execute call and no do/while. That's what lets the JIT autovectorize the kernel's body. The multi-loop path keeps the canonical do { kernel.Execute(...); } while (iternext); driver.
readonly unsafe struct HypotKernel : INDInnerLoop
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Execute(void** p, long* s, long n)
{
// Fast branch — typed pointers so the JIT autovectorizes.
if (s[0] == 4 && s[1] == 4 && s[2] == 4) {
float* a = (float*)p[0], b = (float*)p[1], c = (float*)p[2];
for (long i = 0; i < n; i++)
c[i] = MathF.Sqrt(a[i] * a[i] + b[i] * b[i]);
return;
}
// Slow branch — any stride, scalar.
long sA = s[0], sB = s[1], sC = s[2];
byte* pA = (byte*)p[0]; byte* pB = (byte*)p[1]; byte* pC = (byte*)p[2];
for (long i = 0; i < n; i++) {
float av = *(float*)(pA + i * sA);
float bv = *(float*)(pB + i * sB);
*(float*)(pC + i * sC) = MathF.Sqrt(av * av + bv * bv);
}
}
}
iter.ExecuteGeneric(default(HypotKernel)); // zero-alloc, inlined
For early-exit reductions, the kernel returns false to abort:
readonly unsafe struct AnyNonZero : INDReducingInnerLoop<bool>
{
public bool Execute(void** p, long* s, long n, ref bool acc)
{
long st = s[0]; byte* pt = (byte*)p[0];
for (long i = 0; i < n; i++)
if (*(int*)(pt + i * st) != 0) { acc = true; return false; } // stop
return true;
}
}
bool found = iter.ExecuteReducing<AnyNonZero, bool>(default, false);
On a 1M-element array with a non-zero near the start, this returns after one kernel call.
Layer 3 — Typed ufunc Dispatch
Layer 3 is what you reach for 90% of the time: "run a standard ufunc, pick the best kernel." The bridge inspects the iterator's post-coalesce stride picture, constructs the right cache key for ILKernelGenerator, materializes a SIMD kernel, and invokes it.
public void ExecuteBinary(BinaryOp op); // [in0, in1, out]
public void ExecuteUnary(UnaryOp op); // [in, out]
public void ExecuteComparison(ComparisonOp); // [in0, in1, bool out]
public TResult ExecuteReduction<TResult>(ReductionOp op); // [in] → T
public void ExecuteScan(ReductionOp op); // [in, out]
public void ExecuteCopy(); // [src, dst]
public void BufferedReduce<K, T>(K kernel); // explicit BUFFER+REDUCE double-loop
Under the hood each helper does four things:
- Validate. Throw if operand count or flags are wrong.
- Detect path. Scan operand strides, pick
SimdFull/SimdScalarRight/SimdScalarLeft/SimdChunk/General. - Prepare args.
stackallocone stride array per operand, fill with element strides, grab_state->Shapeand data pointers. - Invoke.
ILKernelGenerator.GetMixedTypeKernel(key)(...)— cache hit returns the cached delegate, cache miss emits IL and caches.
For buffered paths, ExecuteBinary dispatches to RunBufferedBinary, which runs the kernel against _state->Buffers using BufStrides (always element-sized for the buffer dtype) rather than the original-array strides — the inner loop reads contiguous buffer memory while NDIterBufferManager handles the strided copy-in and copy-out.
Custom Operations (Tier 3A / 3B / 3C)
The enum-driven Execute{Binary,Unary,Reduction,...} methods cover every primitive NumPy ufunc, but they're a closed set. The moment you want a*b + c as one pass, or sqrt(a² + b²) without materializing intermediates, or a brand-new op that isn't in BinaryOp/UnaryOp, you're outside the baked catalog.
The Custom Operations extension solves this by letting the bridge IL-generate a kernel specialized for any user-defined computation while preserving Layer 3's 4×-unrolled SIMD shell. Three tiers trade control for convenience:
┌─────────────────── You provide ────────────────────┐
Tier 3A │ the entire inner-loop IL body │ Maximum control
Tier 3B │ per-element scalar + (optional) vector IL body │ Shared unroll shell
Tier 3C │ an expression tree (NDExpr) │ No IL required
└────────────────────────────────────────────────────┘
│
▼
ILKernelGenerator.CompileInnerLoop (new partial)
│
┌─────────┴─────────┐
▼ ▼
Contig SIMD path Scalar strided path
(4× unroll + V256 (per-element, stride-aware
+ 1-vec remainder pointer walk)
+ scalar tail)
└─────────┬─────────┘
▼
NDInnerLoopFunc delegate (cached)
│
▼
NDIterRef.ForEach → do { kernel(...); } while (iternext)
All three tiers produce the same delegate shape (NDInnerLoopFunc) and funnel through ForEach. The factory emits a runtime contig check at the top of the kernel: if every operand's byte stride equals its element size, take the SIMD path; otherwise fall into the scalar-strided loop. Cache keys are user-supplied strings; Tier 3C derives a structural signature automatically if you don't provide one.
Method on NDIterRef |
Tier | What you supply |
|---|---|---|
ExecuteRawIL(emit, key, aux) |
A | Action<ILGenerator> — the entire method, including ret |
ExecuteElementWise(operandTypes, scalarBody, vectorBody, key) |
B | Two Action<ILGenerator> — per-element scalar and vector |
ExecuteElementWiseUnary/Binary/Ternary(...) |
B | Typed convenience overloads |
ExecuteExpression(expr, inputTypes, outputType, key?) |
C | An NDExpr tree |
Tier 3A — Raw IL
You emit everything. Arguments are the canonical inner-loop shape: arg0 = void** dataptrs, arg1 = long* byteStrides, arg2 = long count, arg3 = void* auxdata. Your body must emit its own ret. Cached by the string key you pass — same key returns the same compiled delegate.
iter.ExecuteRawIL(il =>
{
// Pull out pointers and strides once.
var p0 = il.DeclareLocal(typeof(byte*));
var p1 = il.DeclareLocal(typeof(byte*));
var p2 = il.DeclareLocal(typeof(byte*));
// ... load dataptrs[0..2], strides[0..2] ...
// for (i = 0; i < count; i++) *p2 = *p0 + *p1
var i = il.DeclareLocal(typeof(long));
il.Emit(OpCodes.Ldc_I8, 0L); il.Emit(OpCodes.Stloc, i);
var top = il.DefineLabel(); var end = il.DefineLabel();
il.MarkLabel(top);
il.Emit(OpCodes.Ldloc, i); il.Emit(OpCodes.Ldarg_2); il.Emit(OpCodes.Bge, end);
// compute p2[i*s2] = p0[i*s0] + p1[i*s1]
// ...
il.Emit(OpCodes.Ldloc, i); il.Emit(OpCodes.Ldc_I8, 1L); il.Emit(OpCodes.Add); il.Emit(OpCodes.Stloc, i);
il.Emit(OpCodes.Br, top);
il.MarkLabel(end);
il.Emit(OpCodes.Ret);
}, cacheKey: "my_int32_add");
Use when: you need a loop shape the templated shell can't express (gather, scatter, cross-element dependencies, non-rectangular write patterns).
Tier 3B — Templated Inner Loop
Supply only the per-element work; the factory wraps it in the standard 4×-unrolled SIMD + 1-vector remainder + scalar tail + scalar-strided fallback. The two Action<ILGenerator> callbacks are stack-based:
scalarBody— on entry, stack holds N input scalars in order (operand 0 deepest, operand N-1 on top); on exit, stack must hold one value of the output dtype.vectorBody— same contract but withVector{W}<T>values. Optional — passnullfor scalar-only. If non-null and all operand dtypes are identical and the type is SIMD-capable, the factory emits the fast path.
// out = a*b + 1 on 16 float32s, fused in one pass.
iter.ExecuteElementWiseBinary(
NPTypeCode.Single, NPTypeCode.Single, NPTypeCode.Single,
scalarBody: il =>
{
// Stack: [a, b] -> [a*b + 1]
il.Emit(OpCodes.Mul);
il.Emit(OpCodes.Ldc_R4, 1.0f);
il.Emit(OpCodes.Add);
},
vectorBody: il =>
{
// Stack: [va, vb] -> [va*vb + 1]
ILKernelGenerator.EmitVectorOperation(il, BinaryOp.Multiply, NPTypeCode.Single);
il.Emit(OpCodes.Ldc_R4, 1.0f);
ILKernelGenerator.EmitVectorCreate(il, NPTypeCode.Single);
ILKernelGenerator.EmitVectorOperation(il, BinaryOp.Add, NPTypeCode.Single);
},
cacheKey: "fma_f32_c1");
The ILKernelGenerator.Emit* helpers (EmitVectorOperation, EmitVectorCreate, EmitVectorLoad, EmitVectorStore, EmitScalarOperation, EmitConvertTo, EmitLoadIndirect, EmitStoreIndirect, EmitUnaryScalarOperation, EmitUnaryVectorOperation) are exposed as internal so you can compose primitives without reinventing IL emission. The same helpers power the baked ExecuteBinary/ExecuteUnary kernels.
Convenience overloads exist for common arities:
iter.ExecuteElementWiseUnary(inType, outType, scalarBody, vectorBody, key);
iter.ExecuteElementWiseBinary(lhs, rhs, outType, scalarBody, vectorBody, key);
iter.ExecuteElementWiseTernary(a, b, c, outType, scalarBody, vectorBody, key);
For arity > 3 or variable operand counts, use the array form ExecuteElementWise(NPTypeCode[], ...).
When SIMD is skipped. The factory emits the vector path only when CanSimdAllOperands(operandTypes) returns true — every operand's dtype must be identical and SIMD-capable (i.e. not Boolean, Char, or Decimal). If either condition fails, only the scalar path is emitted. Mixed-type ufuncs (e.g. int32 + float32 → float32) use the scalar path with the user's EmitConvertTo inside the body.
Contig runtime check. The kernel's first act is to compare each operand's stride with its element size. If any differ, control jumps to the scalar-strided loop — inner-axis iteration that advances pointers by their declared byte strides. This means a single kernel handles both contiguous and sliced inputs without recompiling.
Use when: you want SIMD + 4× unrolling for a fused or non-standard op but don't want to hand-roll the whole loop.
Tier 3C — Expression DSL
The expression DSL lets you compose ops with C# operator syntax, and Compile() emits the IL for you. No ILGenerator exposure in your code.
// out = sqrt(a² + b²)
var expr = NDExpr.Sqrt(NDExpr.Square(NDExpr.Input(0)) +
NDExpr.Square(NDExpr.Input(1)));
iter.ExecuteExpression(expr,
inputTypes: new[] { NPTypeCode.Single, NPTypeCode.Single },
outputType: NPTypeCode.Single);
Node catalog
Leaves.
| Factory | Semantics | NumPy |
|---|---|---|
NDExpr.Input(i) |
Reference operand i (0-based input index). Auto-converts to output dtype on load. |
— |
NDExpr.Const(value) |
Literal — int / long / float / double overloads. Emitted at the output dtype. |
— |
Binary arithmetic.
| Factory | Operator | SIMD | NumPy equivalent | Notes |
|---|---|---|---|---|
Add(a,b) |
a + b |
✓ | np.add |
|
Subtract(a,b) |
a - b |
✓ | np.subtract |
|
Multiply(a,b) |
a * b |
✓ | np.multiply |
|
Divide(a,b) |
a / b |
✓ | np.divide |
True-division for floats; integer division for ints. |
Mod(a,b) |
a % b |
— | np.mod |
Floored modulo — result sign follows divisor (like Python %), unlike C# % which truncates toward zero. |
Power(a,b) |
— | — | np.power |
Routed through Math.Pow(double, double); integer operands are promoted to double and the result converted back. |
FloorDivide(a,b) |
— | — | np.floor_divide |
Floor toward negative infinity. For signed int operands, correctly returns -4 (not -3) for -10 // 3. |
ATan2(y,x) |
— | — | np.arctan2 |
Four-quadrant arctan via Math.Atan2. |
Binary bitwise. Integer types only; floating-point operands are a compile-time IL emission error.
| Factory | Operator | SIMD | NumPy equivalent |
|---|---|---|---|
BitwiseAnd(a,b) |
a & b |
✓ | np.bitwise_and |
BitwiseOr(a,b) |
a \| b |
✓ | np.bitwise_or |
BitwiseXor(a,b) |
a ^ b |
✓ | np.bitwise_xor |
Scalar-branchy combinators (scalar path only).
| Factory | Semantics | NumPy equivalent |
|---|---|---|
Min(a,b) |
Delegates to Math.Min — NaN-propagating per IEEE 754. |
np.minimum (not np.fmin) |
Max(a,b) |
Delegates to Math.Max — NaN-propagating per IEEE 754. |
np.maximum (not np.fmax) |
Clamp(x,lo,hi) |
Min(Max(x,lo),hi) — sugar, shares the compiled kernel structure with the underlying pair. |
np.clip |
Where(cond,a,b) |
Branchy ternary select: if cond != 0 return a else b. cond is evaluated in the output dtype, so floats, integers, and decimals all work uniformly. |
np.where (with eager eval of both branches) |
Where's branches are both emitted into the kernel but only the taken one runs per element — thebrfalsebranches past the untaken side. If one side is much more expensive (e.g.Exp), the cost is only paid on elements where it's selected, makingWherea real optimization overcond * a + (1-cond) * bfor expensive alternatives.
Unary — arithmetic.
| Factory | Operator | SIMD | NumPy equivalent |
|---|---|---|---|
Negate(x) |
unary -x |
✓ | np.negative |
Abs(x) |
— | ✓ | np.abs / np.absolute |
Sqrt(x) |
— | ✓ | np.sqrt |
Square(x) |
— | ✓ | np.square |
Reciprocal(x) |
— | ✓ | np.reciprocal |
Cbrt(x) |
— | — | np.cbrt |
Sign(x) |
— | — | np.sign |
Unary — exp / log. All route through Math.Exp / Log / ... (or MathF for Single); integer inputs are auto-promoted to double around the call and cast back at the end.
| Factory | Semantics | SIMD | NumPy equivalent |
|---|---|---|---|
Exp(x) |
eˣ | — | np.exp |
Exp2(x) |
2ˣ | — | np.exp2 |
Expm1(x) |
eˣ − 1 (accurate for small x) | — | np.expm1 |
Log(x) |
ln x | — | np.log |
Log2(x) |
log₂ x | — | np.log2 |
Log10(x) |
log₁₀ x | — | np.log10 |
Log1p(x) |
ln(1 + x) (accurate for small x) | — | np.log1p |
Unary — trigonometric.
| Factory | Semantics | SIMD | NumPy equivalent |
|---|---|---|---|
Sin(x), Cos(x), Tan(x) |
Standard trig | — | np.sin / cos / tan |
Sinh(x), Cosh(x), Tanh(x) |
Hyperbolic | — | np.sinh / cosh / tanh |
ASin(x), ACos(x), ATan(x) |
Inverse | — | np.arcsin / arccos / arctan |
Deg2Rad(x) |
x · π/180 | ✓ | np.deg2rad / np.radians |
Rad2Deg(x) |
x · 180/π | ✓ | np.rad2deg / np.degrees |
Unary — rounding.
| Factory | Semantics | SIMD | NumPy equivalent |
|---|---|---|---|
Floor(x) |
⌊x⌋ | ✓ | np.floor |
Ceil(x) |
⌈x⌉ | ✓ | np.ceil |
Round(x) |
Banker's rounding (half-to-even) | — | np.rint (matches NumPy's half-to-even default) |
Truncate(x) |
Toward zero | — | np.trunc |
RoundandTruncatehave a working SIMD path on .NET 9+, but NumSharp's library targets .NET 8 as well, whereVector256.Round/Truncatedon't exist. NDExpr gates them to the scalar path unconditionally so the compiled kernel works on both frameworks. Other contiguous rounding ops autovectorize after tier-1 JIT promotion.
Unary — bitwise / logical / predicates.
| Factory | Operator | SIMD | NumPy equivalent | Notes |
|---|---|---|---|---|
BitwiseNot(x) |
~x |
✓ | np.invert / np.bitwise_not |
Integer types only. |
LogicalNot(x) |
!x |
— | np.logical_not |
Returns 1 if x == 0 else 0. Routes through EmitComparisonOperation(Equal, outType) — correct for all dtypes including Int64, Single, Double, Decimal (see Behavioral notes). |
IsNaN(x) |
— | — | np.isnan |
Returns 0/1 at output dtype. For integer types: always 0. |
IsFinite(x) |
— | — | np.isfinite |
Returns 0/1 at output dtype. For integer types: always 1. |
IsInf(x) |
— | — | np.isinf |
Returns 0/1 at output dtype. For integer types: always 0. |
Comparisons (produce numeric 0 or 1 at output dtype; scalar path only).
| Factory | Semantics | NumPy equivalent |
|---|---|---|
Equal(a,b) |
a == b |
np.equal |
NotEqual(a,b) |
a != b |
np.not_equal |
Less(a,b) |
a < b |
np.less |
LessEqual(a,b) |
a <= b |
np.less_equal |
Greater(a,b) |
a > b |
np.greater |
GreaterEqual(a,b) |
a >= b |
np.greater_equal |
Unlike NumPy's comparison ufuncs (which return bool arrays), Tier 3C's single-output-dtype model collapses comparisons to 0 or 1 at the output dtype. This composes cleanly with arithmetic — e.g. ReLU becomes (x > 0) * x.
NaN semantics match IEEE 754: any comparison involving NaN produces 0 (false). NaN == NaN → 0, NaN < 5 → 0, NaN >= 5 → 0. To test for NaN, use IsNaN(x).
Call — invoke any .NET method. The escape hatch for math not in the node catalog. Scalar path only.
| Factory | Semantics |
|---|---|
Call<T1…Tn, TR>(Func<T1…Tn, TR> f, NDExpr a1, …) |
Typed generic overloads for arity 0–4. Accept method groups without cast (NDExpr.Call(Math.Sqrt, x), NDExpr.Call(Math.Pow, x, y)). |
Call(Delegate func, params NDExpr[] args) |
Catch-all for pre-constructed delegates. Use when the arity exceeds 4 or when the typed overload is ambiguous. |
Call(MethodInfo staticMethod, params NDExpr[] args) |
Invoke a reflection-obtained static method. |
Call(MethodInfo instanceMethod, object target, params NDExpr[] args) |
Invoke a reflection-obtained instance method against target. |
See Call — invoke any .NET method below for dispatch paths, auto-conversion rules, supported signatures, performance envelope, and overload-disambiguation guidance.
Operator overloads
An expression tree reads like ordinary C#:
// (a + b) * c + 1
var linear = (NDExpr.Input(0) + NDExpr.Input(1)) * NDExpr.Input(2) + NDExpr.Const(1.0f);
// ReLU via comparison × input
var relu = NDExpr.Greater(NDExpr.Input(0), NDExpr.Const(0.0f)) * NDExpr.Input(0);
// Clamp with no named method call
var clamped = NDExpr.Min(NDExpr.Max(NDExpr.Input(0), NDExpr.Const(0f)), NDExpr.Const(1f));
Overloads: + - * / (arithmetic), % (NumPy mod), & | ^ (bitwise), unary - (negate), ~ (bitwise not), ! (logical not). No overloads for <, >, ==, != (those need to return bool in C#, which would collide with object.Equals and similar) — use the factory methods (Less, Greater, Equal, NotEqual, LessEqual, GreaterEqual) for comparisons.
Call — invoke any .NET method
The DSL's built-in catalog covers most element-wise math. Call is the escape hatch for everything else: user-defined activations, BCL helpers without a dedicated node (e.g. Math.BitDecrement, Math.CopySign), plugin methods discovered through reflection, captured-state business logic. It trades SIMD for universality.
One node, four factory shapes, three dispatch paths. All four factories construct the same CallNode; the node inspects its input and picks the cheapest dispatch at construction:
┌─────────────────────────┐
NDExpr.Call(...) │ CallNode │
─────────────▶ │ _kind ∈ { │
│ StaticMethod, │ ← call <methodinfo>
│ BoundTarget, │ ← load target, callvirt
│ Delegate │ ← load delegate, Invoke
│ } │
└─────────────────────────┘
Path A — static methods (zero indirection).
// Func<T...> overload: compiler infers delegate signature, no cast needed
// for non-overloaded methods.
NDExpr.Call(Math.Sqrt, NDExpr.Input(0));
NDExpr.Call(Math.Pow, NDExpr.Input(0), NDExpr.Input(1));
NDExpr.Call(MathF.Tanh, NDExpr.Input(0));
// MethodInfo overload: useful when reflecting.
var mi = typeof(Math).GetMethod("BitIncrement", new[] { typeof(double) });
NDExpr.Call(mi, NDExpr.Input(0));
Emit: one call <methodinfo> opcode after the arguments are pushed. The JIT may inline the target when it's small and visible. No DelegateSlots entry, no runtime lookup. This is the fast path and is what you get automatically whenever the delegate has no captured state.
Path B — bound instance methods (one indirection).
class Activations
{
public double Temperature { get; set; }
public double Softmax(double x) => Math.Exp(x / Temperature);
}
var inst = new Activations { Temperature = 1.5 };
var mi = typeof(Activations).GetMethod("Softmax");
NDExpr.Call(mi, inst, NDExpr.Input(0));
Emit: the kernel first loads the target object from a process-wide DelegateSlots registry by integer ID, casts it to the method's declaring type, pushes the arguments, then callvirt <methodinfo>. Cost is one dictionary lookup (~5 ns) plus a virtual call. The target object's state is live — mutations are visible to subsequent kernel invocations.
Path C — captured delegates (one indirection).
// Works uniformly for lambdas with captures, instance-method-bound delegates,
// or any pre-constructed Delegate instance.
Func<double, double> swish = x => x / (1.0 + Math.Exp(-x));
NDExpr.Call(swish, NDExpr.Input(0));
// Pre-constructed delegate with explicit type (no method-group cast needed here).
Delegate d = swish;
NDExpr.Call(d, NDExpr.Input(0));
Emit: the kernel loads the delegate from DelegateSlots, casts it to its concrete runtime type (e.g. Func<double, double>), pushes arguments, then callvirt Invoke. Same ~5-10 ns overhead as Path B, plus the Delegate.Invoke dispatch stub (single virtual call).
Auto-conversion at the call boundary.
The node respects the DSL's single-output-dtype invariant:
ctx.OutputType param dtype return dtype ctx.OutputType
┌──────────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────────┐
│ args evaluated │─▶│ convert via │──▶│ method runs │──▶│ convert via │
│ in outputType │ │ EmitConvertTo│ │ │ │ EmitConvertTo │
└──────────────────┘ └─────────────┘ └───────────────┘ └──────────────────┘
So NDExpr.Call(Math.Sqrt, Input(0)) with an Int32 input and a Double output works end-to-end: the int gets loaded, converted to double at InputNode, arrives at the call as double (no further conversion needed for a Double param), Math.Sqrt runs, the double return flows out to the Double output slot. Flip the output dtype to Single and you'd get an extra Conv_R4 after the call.
Overload disambiguation.
Non-overloaded static methods bind to the typed Func<...> overload via method-group conversion — no cast needed:
NDExpr.Call(Math.Sqrt, x); // ✓ Func<double,double>
NDExpr.Call(Math.Cbrt, x); // ✓ same
NDExpr.Call(MathF.Tanh, x); // ✓ Func<float,float>
NDExpr.Call(Math.Pow, x, y); // ✓ Func<double,double,double>
Methods with multiple overloads (same name, different signatures) need a cast to disambiguate which one you want:
// ERROR: 'Math.Abs' has 9 overloads.
// NDExpr.Call(Math.Abs, x);
// ^^^^^^^^
// CS0121: The call is ambiguous between ...
// Cast to the concrete Func<...> you want:
NDExpr.Call((Func<double, double>)Math.Abs, x); // ✓ picks Math.Abs(double)
NDExpr.Call((Func<float, float>)MathF.Abs, x); // ✓ picks MathF.Abs(float)
NDExpr.Call((Func<long, long>)Math.Abs, x); // ✓ picks Math.Abs(long)
Alternatively, use the MethodInfo overload to pick by signature explicitly:
var mi = typeof(Math).GetMethod(nameof(Math.Abs), new[] { typeof(double) });
NDExpr.Call(mi, x); // unambiguous — the MethodInfo is already picked
Thread safety.
DelegateSlots registration uses Interlocked.Increment for ID generation and ConcurrentDictionary for storage, so concurrent Call construction from multiple threads is safe. Kernel compilation itself happens under the ConcurrentDictionary.GetOrAdd atomicity for the inner-loop cache — one compilation per key, even under contention. Once compiled, kernels are re-entrant (they only read the delegate/target from their immutable slot).
Performance envelope.
Per-element cost of the three paths, measured against a built-in DSL op on a post-warmup 1M-element double array:
| Path | Relative to built-in Sqrt | Notes |
|---|---|---|
| Static method (Path A) | ~1.5× slower | One managed call per element; JIT may inline small targets |
| Bound instance (Path B) | ~2-3× slower | Dict lookup + castclass + virtual call |
| Captured delegate (Path C) | ~2-4× slower | Same lookup + castclass + Delegate.Invoke stub |
These ratios assume the user's method does comparable arithmetic to Math.Sqrt. If your target does substantially more work (e.g. three Math.Exp calls), the ratio collapses toward 1 — the call overhead becomes negligible compared to the math.
Type discipline
Every intermediate value flows through the output dtype: Input(i) loads the i-th operand's dtype and auto-converts (via EmitConvertTo) to the output dtype; constants are emitted directly in the output dtype. This single-type intermediate invariant keeps the DSL simple — you don't need to reason about mixed-type arithmetic inside the tree.
Concrete example — integer to float promotion.
// Input is int32, output is float64. The DSL handles the promotion automatically.
var a = np.array(new int[] { 1, 4, 9, 16, 25 });
var r = np.empty(new Shape(5), np.float64);
using var iter = NDIterRef.MultiNew(2, new[] { a, r }, ...);
iter.ExecuteExpression(NDExpr.Sqrt(NDExpr.Input(0)),
inputTypes: new[] { NPTypeCode.Int32 }, outputType: NPTypeCode.Double);
// r = [1.0, 2.0, 3.0, 4.0, 5.0]
What the emitted IL does per element: load int32, Conv_R8 (promote to double), call Math.Sqrt(double), store double. The conversion is emitted at the Input node, not at the Sqrt node — all subsequent operations see the output-dtype value.
SIMD gate. The vector path is enabled only when every input dtype equals the output dtype (so a single Vector<T> instantiation covers the whole tree) and every node in the tree has a SIMD emit. If any node lacks a SIMD path, the whole compilation falls back to scalar — correctness preserved, but no 4× unroll. For mixed-dtype work you're in the scalar-strided fallback regardless.
SIMD coverage rules
A node's SupportsSimd determines whether Tier 3C emits the vector body:
- Yes:
Input,Const, the four arithmetic binary ops (+ - * /), the three bitwise binary ops (& | ^), and the unary opsNegate,Abs,Sqrt,Floor,Ceil,Square,Reciprocal,Deg2Rad,Rad2Deg,BitwiseNot. - No:
Mod,Power,FloorDivide,ATan2,Min/Max/Clamp/Where, all comparisons,Round,Truncate(no net8 SIMD method), all trig (exceptDeg2Rad/Rad2Deg), all log/exp,Sign,Cbrt,LogicalNot, predicates (IsNaN/IsFinite/IsInf),Call(user methods are always scalar — there is no vectorization path for arbitrary managed calls).
Predicate / LogicalNot result handling. Predicates (IsNaN/IsFinite/IsInf) and LogicalNot emit an I4 0/1 on the stack, not a value of the output dtype. UnaryNode detects these ops and inserts a trailing EmitConvertTo(Int32, outType) so the factory's final Stind matches. LogicalNot in particular routes through EmitComparisonOperation(Equal, outType) with an output-dtype zero literal: the Ldc_I4_0 + Ceq sequence is correct only for I4-width values (bool/byte/int16/int32), so the typed-zero comparison is what makes the result correct for Int64, Single, Double, and Decimal too.
A tree's SupportsSimd is true only if every node in it does. One unsupported node demotes the whole tree to scalar-only — which is usually still autovectorized by the JIT after tier-1 promotion, just without the 4× unroll.
Caching and auto-keys
Pass cacheKey to share the compiled delegate across iterators; omit it and the compiler auto-derives one from the tree's structural signature plus input/output dtypes. Actual examples (verified against NDExpr.AppendSignature):
NDExpr:Add(Multiply(In[0],Const[2]),Const[3]):in=Double:out=Double
NDExpr:Sqrt(Add(Square(In[0]),Square(In[1]))):in=Single,Single:out=Single
NDExpr:Where(CmpGreater(In[0],Const[0]),In[0],Multiply(Const[0.1],In[0])):in=Double:out=Double
NDExpr:Min(In[0],In[1]):in=Int32,Int32:out=Int32
NDExpr:IsNan(In[0]):in=Double:out=Double
NDExpr:LogicalNot(In[0]):in=Double:out=Double
NDExpr:BitwiseNot(In[0]):in=Int32:out=Int32
NDExpr:Mod(In[0],Const[3]):in=Double:out=Double
NDExpr:Sqrt(In[0]):in=Int32:out=Double ← int input, double output
NDExpr:Call[System.Math.Sqrt#100663308@<guid>](In[0]):in=Double:out=Double
NDExpr:Call[MyApp.Activations.Swish#167772171@<guid>,target#7](In[0]):in=Double:out=Double
Enum names appear verbatim (e.g. Multiply, not Mul; IsNan, not IsNaN — the enum is spelled IsNan).
Two trees with identical structure and types get the same auto-derived key and share a cached kernel. Each node class contributes a distinct signature prefix:
| Node class | Signature fragment |
|---|---|
InputNode |
In[i] |
ConstNode |
Const[value] (integer form if constructed from int/long; decimal form for float/double) |
BinaryNode |
<BinaryOp>(L,R) (e.g. Add(...), Mod(...), ATan2(...)) |
UnaryNode |
<UnaryOp>(C) (e.g. Sqrt(...), IsNan(...), BitwiseNot(...)) |
ComparisonNode |
Cmp<Op>(L,R) (e.g. CmpEqual(...), CmpGreater(...)) |
MinMaxNode |
Min(L,R) or Max(L,R) |
WhereNode |
Where(C,A,B) |
CallNode |
Call[<declaringType>.<name>#<metadataToken>@<moduleGuid>](args) — for instance methods, additionally ,target#<slotId> |
Constant value sensitivity. Two trees that differ only in a constant value (e.g.
x + 1vsx + 2) generate distinct keys — the constant is part of the signature, because it's baked into the emitted IL. If you need many kernels parameterized by a scalar, consider passing the scalar as a second input operand (as a 0-dNDArrayor a broadcast view) rather than a compile-time constant.Integer/float const collision.
NDExpr.Const(1)andNDExpr.Const(1.0)both serialize toConst[1]when thedoublevalue is whole. With the same output dtype they produce identical IL, so sharing a cache entry is correct. If you need to distinguish — say, to force a specific integer vs float constant interpretation — construct both trees separately and supply an explicitcacheKey.
Memory model and lifetime
Three things outlive a single Tier 3C call. Knowing what they are and how long they live explains the steady-state memory profile of a Tier 3C-heavy program.
1. Compiled kernels (_innerLoopCache).
Every unique (structural signature, inputTypes, outputType) triple produces a DynamicMethod that's JIT-compiled once and cached in a process-wide ConcurrentDictionary<string, NDInnerLoopFunc> keyed by the cache-key string. The cache is append-only within the process lifetime. Cache keys are strings, so GC collects the old tree nodes once compilation completes, but the compiled delegate itself holds its DynamicMethod handle indefinitely.
Typical memory profile:
- Each compiled kernel is ~2-5 KB of native code + its metadata in the runtime's dynamic-method table.
- Typical application: a few dozen unique expressions → ~100-200 KB of steady-state cache.
- Worst case: a hot loop constructing new-per-call trees → linear growth. Reuse expression objects or pass explicit cache keys.
To inspect or reset during tests:
GeneratedDelegates.InnerLoopCount; // count of compiled kernels (public)
GeneratedDelegates.ClearInnerLoop(); // internal — wipe for fresh-start testing
Both are internal, so scripts need the AssemblyName=NumSharp.DotNetRunScript override.
2. Registered delegates and bound targets (DelegateSlots).
Paths B and C of Call stash a managed reference in a static ConcurrentDictionary<int, Delegate> or ConcurrentDictionary<int, object> so the emitted IL can look it up at runtime. The reference is strong — entries live for the process lifetime. This is necessary: if the reference were weak, the GC could collect the delegate while a compiled kernel still holds its slot ID, and the next lookup would throw.
The cost is small per registration (~16-32 bytes for the dictionary entry plus whatever the delegate captures), but unbounded across registrations. Registering one delegate per kernel is fine; registering one delegate per iteration of a loop is a leak.
| Pattern | Registrations | Memory impact |
|---|---|---|
| Static method (Path A) | zero | none |
| Cached delegate reused every iter | one | negligible |
| Per-call lambda | one per call | linear in call count |
Test hook:
DelegateSlots.RegisteredCount; // strong-ref count across both dicts
DelegateSlots.Clear(); // wipe for testing (invalidates kernels that reference it!)
Calling
DelegateSlots.Clear()while a kernel that references a slot is still compiled makes the next call throwKeyNotFoundExceptionfrom inside the generated IL. Use it only in test setup/teardown, paired with clearing the inner-loop cache.
3. NDArrays referenced by the iterator.
Orthogonal to Tier 3C, but worth mentioning in the same section for completeness: NDIterRef holds a managed NDArray[] field so the operands' backing memory isn't collected mid-iteration. The field is released when you Dispose() the ref — the using var iter = ... pattern handles this automatically. Forgetting to dispose keeps the NDArrays alive for however long the iterator lives.
Registration-once pattern.
For Call-based activations or user kernels used in hot loops, the idiomatic pattern is:
static class MyActivations
{
// One delegate instance, registered once when the static class is first touched.
public static readonly Func<double, double> Swish =
x => x / (1.0 + Math.Exp(-x));
public static readonly Func<double, double> GELU =
x => 0.5 * x * (1.0 + Math.Tanh(
Math.Sqrt(2.0 / Math.PI) * (x + 0.044715 * x * x * x)));
}
// Usage — reuses the same slot + cached kernel every time:
var swished = NDExpr.Call(MyActivations.Swish, NDExpr.Input(0));
var gelud = NDExpr.Call(MyActivations.GELU, NDExpr.Input(0));
Validation and errors
The DSL fails fast at tree-construction time for structural errors and at compile time for type-mismatch or arity errors:
| Error condition | Where | Exception |
|---|---|---|
NDExpr.Input(-1) |
Factory | ArgumentOutOfRangeException |
NDExpr.Sqrt(null) |
Node constructor | ArgumentNullException |
NDExpr.Add(null, x) / Add(x, null) |
Node constructor | ArgumentNullException |
ExecuteExpression(expr, null, outType) |
Bridge entry | ArgumentNullException |
ExecuteExpression(expr, inputTypes, outType) with too-few inputs vs operand count |
Bridge entry | ArgumentException |
Input(5) when tree compiled with 2 inputs |
Compile-time IL emission | InvalidOperationException — message: "Input(5) out of range; compile provided 2 inputs." |
| Tree calls a vector-only path on a non-SIMD type (shouldn't happen via public API) | Compile-time | NotSupportedException |
Runtime errors depend on the op and dtype:
Divide/Mod/FloorDividewith a zero integer divisor →DivideByZeroExceptionfrom the CLI. Float division by zero produces±Infinity/NaNper IEEE 754, no exception.Power(neg, fractional)→NaNviaMath.Pow, no exception.- Overflow during
Conv_*from a float that's outside the target integer range → silently wraps or saturates per the CLI's conv opcode semantics (matchesunchecked {}casts in C#). UseConv_Ovf_*if you need checked behavior — not exposed through the DSL.
Behavioral notes
Non-obvious but permanent behaviors of the DSL:
NaN propagation in
Min/Maxmatchesnp.minimum/np.maximum, notnp.fmin/np.fmax. If you need NaN-skipping min/max, compose withIsNaNandWhere:// fmin(a, b): return non-NaN if one is NaN, else min var fmin = NDExpr.Where(NDExpr.IsNaN(a), b, NDExpr.Where(NDExpr.IsNaN(b), a, NDExpr.Min(a, b)));Moddoesn't match C#%for negative operands. C# truncates toward zero (-10 % 3 == -1); NumPy (andNDExpr.Mod) floor toward negative infinity (-10 mod 3 == 2). This matches Python%.Integer division by zero throws.
Divide(int_arr, int_arr_with_zero)raisesDivideByZeroExceptionat runtime. Float division is silent (produces±Infinity/NaN).Constants widen to the output dtype.
NDExpr.Const(1_000_000_000) + NDExpr.Input(0)where the output isBytewill emitLdc_I4 1000000000followed byConv_U1— the billion wraps to a small byte. The DSL won't check that the constant fits; you get silent truncation.Bitwise ops require integer output dtype.
NDExpr.Input(0) & NDExpr.Input(1)withoutputType = Doubleis a malformed tree —EmitScalarOperation(BitwiseAnd, Double)doesn't emitAndfor floats. You'll get anInvalidOperationExceptionor unverifiable IL at compile time. Use an integer output dtype, or convert throughBitwiseNot/BitwiseAndin integer land and cast to float separately.LogicalNotisx == 0, notx != 0. It returns 1 when the input is zero and 0 otherwise. Same as Python'snotapplied to a numeric value. If you want "non-zero as 1", useNDExpr.NotEqual(x, NDExpr.Const(0)).Input dtype mismatch is silent. If your
inputTypes[]saysInt32but the actual NDArray operand isInt16, the kernel reads 4 bytes starting at the int16 pointer — garbage. The iterator's buffer/cast machinery only kicks in withBUFFERED | NPY_*_CASTING. For ad-hoc Tier 3C use, make sureinputTypes[i]matches the actual NDArray dtype, or run the iterator with casting flags.Comparisons in non-float arithmetic can be off-by-one. For integer-output trees,
NDExpr.Greater(x, Const(0.5))withxasInt32will compare two integers —Const(0.5)gets emitted asLdc_I4 0, becauseConstNode.EmitLoadTypedconverts the literal to the output dtype's CLI type.Greater(int_x, 0)is rarely the intended comparison. Use an explicitConst(1)with the correct integer threshold, or change the output dtype to a float.Whereduplicates both branches in IL. The true-branch IL and false-branch IL are emitted sequentially with abrskipping the false side when cond is true. Deeply-nestedWheres quadruple IL size (1 → 2 → 4 → 8 branches). For more than ~10 levels of nesting, consider flattening with a lookup table via Tier 3B.Calldelegates are held forever.CallNodestashes captured delegates and bound instance targets in a process-wideDelegateSlotsdictionary so the emitted IL can look them up. There is no eviction. If you callNDExpr.Call(x => x * scale, in0)inside a hot loop (creating a new closure each iteration), the dictionary grows without bound. Register delegates once at startup — astatic readonly Func<double, double>field or a DI singleton — and reuse them.Callmethod-group ambiguity.NDExpr.Call(Math.Abs, x)fails to compile becauseMath.Abshas nine overloads (double,float,int,long, etc.) and the compiler can't pick one. Cast to the specificFunc<...>you want:NDExpr.Call((Func<double, double>)Math.Abs, x). Single-overload methods likeMath.Sqrt,Math.Cbrt,Math.Logbind without cast.Callruns at scalar speed. A managed method call per element forfeits SIMD. For a sustained throughput-critical op, it's ~30-50% slower than the equivalent built-in DSL node because the call itself has overhead beyond just computing the result. UseCallfor math the DSL doesn't expose (user-defined activations,MathNet.Numericsroutines, lookup tables via a method), not for things likeSqrtwhereNDExpr.Sqrt(x)is the right answer.Callreturn type widening is lossy for NaN. If a delegate returnsintand the tree output isdouble,Math.Floor(NaN) = NaNgets cast toint(yielding0or some CPU-dependent value), which widens back to the float representation of that integer. NaN information is lost across integer-returning calls. Match return dtype to output dtype when NaN correctness matters.
Debugging compiled kernels
Tier 3C kernels are DynamicMethod delegates — you can't step into their IL with a debugger as-is. What you can do:
- Inspect the kernel cache.
GeneratedDelegates.InnerLoopCount(public) gives you a count.GeneratedDelegates.ClearInnerLoop()(internal; needs[InternalsVisibleTo]or adotnet_runscript withAssemblyName=NumSharp.DotNetRunScript) lets you force recompilation in a test. - Inspect the delegate slot registry (only relevant when
Callis in play).DelegateSlots.RegisteredCount(internal) returns the sum of registered delegates + registered instance targets. Growing unboundedly means a per-call lambda or target allocation somewhere — find it by comparing counts before and after your suspected hot path.DelegateSlots.Clear()wipes the registry; always pair withGeneratedDelegates.ClearInnerLoop()because cleared-but-cached kernels will throwKeyNotFoundExceptionon their next invocation. - Print the auto-derived cache key. Construct the tree, call
new StringBuilder().Also(e => node.AppendSignature(sb))(AppendSignatureis internal). The printed signature is exactly what goes into the cache key — useful for diagnosing "why aren't these two trees sharing a kernel?". ForCallnodes in particular, the signature includesMetadataTokenandModuleVersionId— if those differ across two calls of what you thought was the same method, the compiler loaded the method from different assemblies or modules. - Reduce to a minimal tree. If a compiled kernel misbehaves, isolate the failing subtree by compiling just that fragment against a tiny input (1-3 elements).
ExecuteExpressionon a 3-element array still exercises the scalar path; crashes become reproducible in a few lines. - Watch the output dtype.
ExecuteExpressionexpectsoutputTypeto match the output NDArray's dtype. If they disagree, the kernel reads/writes wrong byte counts. Double-check both. - Diagnose "method group ambiguous" errors. If you see
CS0121: The call is ambiguous between the following methodswhen writingNDExpr.Call(Math.X, ...), the method has multiple overloads (e.g.Math.Abshas 9). Cast to the specificFunc<...>you want, or use theMethodInfooverload with an explicit parameter-types array toGetMethod. - Diagnose "Method X returns void" errors — you passed a method with no return value to
Call. Tier 3C requires every node to contribute a value to the output dtype. - Diagnose "Target is X, method declares Y" errors — your instance
MethodInfocall received a target that isn't an instance of the method's declaring type. Confirm both the method and the target came from the same type, especially if you're reflecting across a plugin boundary. - Enable IL dumps by emitting into a persistent assembly instead of
DynamicMethod— not a supported build configuration, butILKernelGenerator.InnerLoop.csis a single partial file you can modify in a workspace-only diff if you need to dump bytes during development.
When to use Tier 3C
Reach for Tier 3C when you want Layer 3 ergonomics for fused or custom ops and you're not chasing the last 15% of throughput. The DSL covers arithmetic, bitwise, rounding, transcendentals (exp/log/trig/hyperbolic/inverse-trig), predicates (IsNaN/IsFinite/IsInf), comparisons, Min/Max/Clamp/Where, and common compositions (ReLU, Leaky ReLU, sigmoid, clamp, hypot, linear, FMA, piecewise functions) without writing IL. For absolute peak perf on a hot ufunc — or for ops outside the DSL's node catalog (e.g. intrinsics the runtime exposes but the DSL doesn't wrap) — drop to Tier 3B and hand-tune the vector body.
Decision tree: which tier do I need?
Is the op a standard NumPy ufunc already in ExecuteBinary/Unary/Reduction?
yes → Layer 3 (baked). Fastest, zero work. Done.
no ↓
Can I express it as a tree of DSL nodes (Add, Sqrt, Where, Exp, etc.)?
yes → Tier 3C. Fused, SIMD-or-scalar automatic, no IL.
no ↓
Is the missing piece a BCL method (Math.X, user activation, reflected plugin)?
yes → Tier 3C with Call. Scalar but fused. Done.
no ↓
Do I need V256/V512 intrinsics the DSL doesn't wrap (Fma, Shuffle, ...)?
yes → Tier 3B. Hand-write the vector body; factory wraps the shell.
no ↓
Is the loop shape non-rectangular (gather/scatter, cross-element deps)?
yes → Tier 3A. Emit the whole inner-loop IL yourself.
Caching is shared across all tiers. All three write into the same _innerLoopCache inside ILKernelGenerator.InnerLoop.cs. The first ExecuteRawIL("k") call JIT-compiles; every subsequent call with the same key returns the cached delegate immediately. GeneratedDelegates.InnerLoopCount (public) exposes the size for tests.
Path Detection
DetectExecutionPath() is the heart of Layer 3. It looks at the iterator after coalescing and negative-stride flipping, and picks:
if (CONTIGUOUS flag set) return SimdFull;
if (NDim == 0) return SimdFull;
if (op1 is scalar AND op0 is contiguous) return SimdScalarRight;
if (op0 is scalar AND op1 is contiguous) return SimdScalarLeft;
if (every operand's innermost stride ∈ {0, 1}) return SimdChunk;
otherwise return General;
"Scalar" here means every stride is 0 across every dimension — the operand is a 0-d array or a fully broadcasted view. "Contiguous" uses the standard backward stride check.
The resulting ExecutionPath is baked into the MixedTypeKernelKey:
var key = new MixedTypeKernelKey(LhsType, RhsType, ResultType, Op, Path);
Different paths get different IL. SimdFull emits a flat 4× unrolled SIMD loop. SimdScalarRight broadcasts the scalar into a vector once, then runs a SIMD loop against only the LHS. SimdChunk processes the inner dim as a chunk within an outer coordinate loop. General does full coordinate-based iteration in IL. All of that machinery already lives in ILKernelGenerator; Layer 3's job is just to pick the right key.
Worked Examples
Seventeen worked examples grouped by API tier.
Layers 1–3 (baked kernels):
- Three-operand binary over a 3-D contiguous array
- Array × scalar with broadcast detection
- Sliced view — non-contiguous input
- Fused hypot via Layer 1
- Early-exit Any over 1M elements
Tier 3B (templated scalar + vector bodies):
Tier 3C (expression DSL):
- ReLU via Tier 3C comparison-multiply
- Clamp with Min/Max
- Softmax-ish: exp then divide-by-sum
- Sigmoid via Where for numerical stability
- NaN-replacement using IsNaN + Where
- Leaky ReLU via piecewise Where
- Manual abs via comparison + Where
- Heaviside step function
- Polynomial evaluation via Horner's method
- Piecewise: absolute value of sine (abs(sin(x)))
- User-defined activation via NDExpr.Call
- Reflected MethodInfo with an instance method
1. Three-operand binary over a 3-D contiguous array
var a = np.arange(24, dtype: np.float32).reshape(2, 3, 4);
var b = (np.arange(24, dtype: np.float32).reshape(2, 3, 4) * 2f).astype(np.float32);
var c = np.zeros(new Shape(2, 3, 4), np.float32);
using var iter = NDIterRef.MultiNew(
nop: 3, op: new[] { a, b, c },
flags: NDIterGlobalFlags.None,
order: NPY_ORDER.NPY_KEEPORDER,
casting: NPY_CASTING.NPY_NO_CASTING,
opFlags: new[] { NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.READONLY,
NDIterPerOpFlags.WRITEONLY });
iter.ExecuteBinary(BinaryOp.Add);
// NDim = 1 after coalesce, Path = SimdFull
// ILKernelGenerator emits a 4×-unrolled V256 add loop
// c[1,2,3] = 69
One call. 3-D → 1-D coalesce → one SIMD kernel runs over 24 elements. The generated IL is the same regardless of whether a and b started as 3-D, 4-D, or flat — as long as they're contiguous.
2. Array × scalar with broadcast detection
var vec = np.arange(8, dtype: np.float32);
var sc = np.full(new Shape(), 100f, NPTypeCode.Single); // 0-d scalar
var res = np.zeros(new Shape(8), np.float32);
using var iter = NDIterRef.MultiNew(3, new[] { vec, sc, res }, ...);
Console.WriteLine(iter.DetectExecutionPath()); // SimdScalarRight
iter.ExecuteBinary(BinaryOp.Multiply);
// res = vec * 100
The 0-d scalar comes through with all strides equal to 0, so DetectExecutionPath picks SimdScalarRight. The kernel loads the scalar once, splats it into a V256 register, and multiplies the whole LHS against it.
3. Sliced view — non-contiguous input
var big = np.arange(20, dtype: np.float32).reshape(4, 5);
var slice = big[":, 1:4"]; // 4×3 view, strides = [5, 1]
var dst = np.zeros(new Shape(4, 3), np.float32);
using var iter = NDIterRef.MultiNew(2, new[] { slice, dst }, ...);
iter.ExecuteUnary(UnaryOp.Sqrt);
// dst[3,2] = sqrt(big[3,3]) = sqrt(18) ≈ 4.243
The slice can't coalesce (stride 5 on outer axis, stride 1 on inner) so NDim stays at 2 and IsContiguous is false. Layer 3 picks the strided UnaryKernel, which computes offset = sum(coord[d] * stride[d]) at each element.
4. Fused hypot via Layer 1
using var iter = NDIterRef.MultiNew(3, new[] { a, b, result },
NDIterGlobalFlags.EXTERNAL_LOOP, ...);
iter.ForEach((ptrs, strides, count, _) => {
if (strides[0] == 4 && strides[1] == 4 && strides[2] == 4) {
float* pa = (float*)ptrs[0], pb = (float*)ptrs[1], pc = (float*)ptrs[2];
for (long i = 0; i < count; i++)
pc[i] = MathF.Sqrt(pa[i] * pa[i] + pb[i] * pb[i]); // JIT → V256
} else {
byte* pA = (byte*)ptrs[0], pB = (byte*)ptrs[1], pC = (byte*)ptrs[2];
long sA = strides[0], sB = strides[1], sC = strides[2];
for (long i = 0; i < count; i++) {
float av = *(float*)(pA + i * sA);
float bv = *(float*)(pB + i * sB);
*(float*)(pC + i * sC) = MathF.Sqrt(av * av + bv * bv);
}
}
});
Without Layer 1 this operation would be sqrt(a * a + b * b) — three Layer 3 calls and three temporary arrays. Fused into one kernel, it runs in a single pass with zero intermediates. The stride branch is the idiom that lets the JIT autovectorize the tight case while the outer shape keeps the kernel correct for strided inputs.
5. Early-exit Any over 1M elements
var data = np.zeros(new Shape(1_000_000), NPTypeCode.Int32);
data[500] = 1;
using var iter = NDIterRef.New(data, flags: NDIterGlobalFlags.EXTERNAL_LOOP);
bool found = iter.ExecuteReducing<AnyNonZero, bool>(default, false);
// found = true, after exactly one ForEach call (SIMD early exit inside kernel).
6. Fused hypot via Tier 3C expression
The same hypot operation written as an expression tree — no IL, no hand-written stride branch. The factory emits a 4×-unrolled V256 kernel on the contiguous path and a scalar-strided fallback on non-contiguous input.
using var iter = NDIterRef.MultiNew(3, new[] { a, b, result },
NDIterGlobalFlags.EXTERNAL_LOOP, ...);
var expr = NDExpr.Sqrt(NDExpr.Square(NDExpr.Input(0)) +
NDExpr.Square(NDExpr.Input(1)));
iter.ExecuteExpression(expr,
inputTypes: new[] { NPTypeCode.Single, NPTypeCode.Single },
outputType: NPTypeCode.Single);
// result[i] = sqrt(a[i]² + b[i]²), fused in one pass, SIMD-vectorized
Compare with example 4 — same output, same performance envelope, no IL emission visible in your code. The tree's structural signature "Sqrt(Add(Square(In[0]),Square(In[1])))" becomes the cache key, so every iterator that runs the same expression reuses the same compiled delegate.
7. Fused linear transform via Tier 3B with vector body
When you want the Tier 3C ergonomics but also want the vector body under your control (e.g. to insert a Vector256 intrinsic the DSL doesn't expose):
iter.ExecuteElementWiseBinary(
NPTypeCode.Single, NPTypeCode.Single, NPTypeCode.Single,
scalarBody: il =>
{
// Stack: [a, b] → [a*2 + b*3]
il.Emit(OpCodes.Ldc_R4, 2.0f); il.Emit(OpCodes.Mul); // a*2
var tmp = il.DeclareLocal(typeof(float));
il.Emit(OpCodes.Stloc, tmp); // stash a*2
il.Emit(OpCodes.Ldc_R4, 3.0f); il.Emit(OpCodes.Mul); // b*3
il.Emit(OpCodes.Ldloc, tmp); il.Emit(OpCodes.Add); // a*2 + b*3
},
vectorBody: il =>
{
// Stack: [va, vb]
il.Emit(OpCodes.Ldc_R4, 2.0f); ILKernelGenerator.EmitVectorCreate(il, NPTypeCode.Single);
ILKernelGenerator.EmitVectorOperation(il, BinaryOp.Multiply, NPTypeCode.Single); // va*2
var tmp = il.DeclareLocal(ILKernelGenerator.GetVectorType(typeof(float)));
il.Emit(OpCodes.Stloc, tmp);
il.Emit(OpCodes.Ldc_R4, 3.0f); ILKernelGenerator.EmitVectorCreate(il, NPTypeCode.Single);
ILKernelGenerator.EmitVectorOperation(il, BinaryOp.Multiply, NPTypeCode.Single); // vb*3
il.Emit(OpCodes.Ldloc, tmp);
ILKernelGenerator.EmitVectorOperation(il, BinaryOp.Add, NPTypeCode.Single);
},
cacheKey: "linear_2a_3b_f32");
Single pass, no temporaries, SIMD-unrolled. Conceptually the same as 2*a + 3*b written via Tier 3C, but lets you drop in Vector256.Fma or similar intrinsics if you ever need them.
8. ReLU via Tier 3C comparison-multiply
ReLU in one fused kernel, leveraging Tier 3C's "comparison returns 0/1 at output dtype" semantics:
using var iter = NDIterRef.MultiNew(2, new[] { input, output },
NDIterGlobalFlags.EXTERNAL_LOOP, ...);
var relu = NDExpr.Greater(NDExpr.Input(0), NDExpr.Const(0.0f)) * NDExpr.Input(0);
iter.ExecuteExpression(relu,
new[] { NPTypeCode.Single }, NPTypeCode.Single);
// output[i] = max(input[i], 0) for every i
No branch, no intermediate array. The comparison node emits an I4 0/1, gets converted to float, and the multiply folds it into the final value. Scalar path only (comparisons don't SIMD), but the JIT autovectorizes the resulting tight loop post-tier-1.
9. Clamp with Min/Max
var clamped = NDExpr.Clamp(NDExpr.Input(0), NDExpr.Const(-1.0), NDExpr.Const(1.0));
iter.ExecuteExpression(clamped,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
// output[i] = min(max(input[i], -1), 1)
Clamp is just sugar for Min(Max(x, lo), hi) — both map to branchy scalar selects that propagate NaN (matching np.minimum / np.maximum rather than np.fmin / np.fmax).
10. Softmax-ish: exp then divide-by-sum
Tier 3C is element-wise; reductions (like summing all elements) aren't expressible directly. But the element-wise half of softmax is:
// out = exp(x - max_x) / sum_exp — where max_x and sum_exp are precomputed scalars.
var shifted = NDExpr.Subtract(NDExpr.Input(0), NDExpr.Const(maxX));
var numerator = NDExpr.Exp(shifted);
var result = numerator / NDExpr.Const(sumExp);
iter.ExecuteExpression(result,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Scalar path only (Exp isn't in the vector emit set), but the tree fuses three operations into one kernel — versus three Layer 3 calls with two temporary arrays.
11. Sigmoid via Where for numerical stability
The naive 1 / (1 + exp(-x)) overflows for very negative x (exp of a large positive number). A numerically stable form uses two branches:
// { 1 / (1 + exp(-x)) if x >= 0
// sigmoid = { exp(x) / (1 + exp(x)) if x < 0
var x = NDExpr.Input(0);
var pos = NDExpr.Const(1.0) / (NDExpr.Const(1.0) + NDExpr.Exp(-x));
var neg = NDExpr.Exp(x) / (NDExpr.Const(1.0) + NDExpr.Exp(x));
var stable = NDExpr.Where(NDExpr.GreaterEqual(x, NDExpr.Const(0.0)), pos, neg);
iter.ExecuteExpression(stable,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Every branch computes three Exp calls in the worst case, but only the taken branch's values are materialized — Where emits actual brfalse + jump IL, not a branchless blend. For large arrays, branch prediction handles a sign-bit pattern well. If your input is already known to be mostly positive or mostly negative, this is noticeably cheaper than the naive 1/(1+exp(-x)) kernel.
12. NaN-replacement using IsNaN + Where
// replace NaN with 0
var x = NDExpr.Input(0);
var clean = NDExpr.Where(NDExpr.IsNaN(x), NDExpr.Const(0.0), x);
iter.ExecuteExpression(clean,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
IsNaN(x) emits a double.IsNaN call that leaves an I4 0/1 on the stack, and UnaryNode inserts an implicit EmitConvertTo(Int32, Double) so Where's condition-normalizer gets the right dtype. The whole tree is scalar-only but fuses NaN-detection and replacement into a single pass.
13. Leaky ReLU via piecewise Where
// leaky_relu(x, alpha=0.1) = x if x > 0 else alpha * x
var x = NDExpr.Input(0);
var leaky = NDExpr.Where(
NDExpr.Greater(x, NDExpr.Const(0.0)),
x,
NDExpr.Const(0.1) * x);
iter.ExecuteExpression(leaky,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Contrast with the "branchless" ReLU ((x > 0) * x): that works for plain ReLU because the false branch is zero, but doesn't handle Leaky ReLU's non-zero negative side. Where is the general escape hatch.
14. Manual abs via comparison + Where
A worked example of combining comparisons with Where for pedagogical purposes (the DSL's Abs is faster — it has a SIMD path):
var x = NDExpr.Input(0);
var manualAbs = NDExpr.Where(
NDExpr.Less(x, NDExpr.Const(0.0)),
-x, // operator overload for Negate
x);
iter.ExecuteExpression(manualAbs,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
This is ~10% slower than NDExpr.Abs(x) because it runs the scalar-only Where instead of the SIMD-vectorized Abs. Use the built-in where possible; Where is the generalization when no built-in fits.
15. Heaviside step function
// heaviside(x, h0) = 0 if x < 0, h0 if x == 0, 1 if x > 0
// NumPy's np.heaviside(x, 0.5) is the default "midpoint" convention.
var x = NDExpr.Input(0);
var step = NDExpr.Where(
NDExpr.Less(x, NDExpr.Const(0.0)),
NDExpr.Const(0.0),
NDExpr.Where(
NDExpr.Greater(x, NDExpr.Const(0.0)),
NDExpr.Const(1.0),
NDExpr.Const(0.5))); // h0 value at x == 0
iter.ExecuteExpression(step,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Three-way nested Where flattens to linear IL — two brfalse branches at runtime. The auto-derived cache key becomes Where(CmpLess(In[0],Const[0]),Const[0],Where(CmpGreater(In[0],Const[0]),Const[1],Const[0.5])). Reused automatically across iterators.
16. Polynomial evaluation via Horner's method
Evaluate p(x) = 1·x⁴ + 2·x³ + 3·x² + 4·x + 5 with optimal multiplications:
// ((((1·x + 2)·x + 3)·x + 4)·x + 5
var x = NDExpr.Input(0);
var poly = (((NDExpr.Const(1.0) * x + NDExpr.Const(2.0)) * x
+ NDExpr.Const(3.0)) * x
+ NDExpr.Const(4.0)) * x
+ NDExpr.Const(5.0);
iter.ExecuteExpression(poly,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Four Multiplys, four Adds — all SIMD-capable. Whole tree emits the 4×-unrolled V256 path. For a degree-N polynomial this stays in registers end-to-end, with no intermediate array allocations. Compare with the naïve 1*x*x*x*x + 2*x*x*x + 3*x*x + 4*x + 5 — ten multiplications, same IL size after constant folding by the JIT, but less readable.
17. Piecewise: absolute value of sine (abs(sin(x)))
Combine the two unary SIMD-capable ops for the pattern |sin x|:
var expr = NDExpr.Abs(NDExpr.Sin(NDExpr.Input(0)));
iter.ExecuteExpression(expr,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
Sin is scalar-only, so the whole tree runs scalar (no 4× unroll). But both ops fuse into one pass — a single Math.Sin call + Math.Abs per element. The alternative — two Layer 3 calls on three arrays — would allocate a sin(x) temporary.
18. User-defined activation via NDExpr.Call
Say you want Swish (x * sigmoid(x), used in EfficientNet and family) but Tier 3C doesn't have a Sigmoid node. Drop to Call:
// Registered once at startup — static readonly field, not a per-call lambda.
static readonly Func<double, double> SwishActivation =
x => x / (1.0 + Math.Exp(-x));
// Tree: out = Swish(x) + bias (bias is a broadcast-scalar Input, not a Const)
var expr = NDExpr.Call(SwishActivation, NDExpr.Input(0)) + NDExpr.Input(1);
iter.ExecuteExpression(expr,
new[] { NPTypeCode.Double, NPTypeCode.Double }, NPTypeCode.Double);
The SwishActivation delegate is registered exactly once into DelegateSlots; every subsequent iter reuses the same slot ID and the same compiled kernel (auto-derived cache key is stable because it's keyed by MethodInfo.MetadataToken, not delegate identity). Runtime overhead is ~5 ns per element for the slot lookup + one Delegate.Invoke call per element — still single-pass, still zero intermediates.
For maximum speed, if your activation is hot enough to matter, compose it out of DSL primitives:
var x = NDExpr.Input(0);
var swish = x / (NDExpr.Const(1.0) + NDExpr.Exp(-x)); // same op, no Call overhead
19. Reflected MethodInfo with an instance method
Sometimes you're calling a method you discovered via reflection (e.g. an op registered through a plugin system). Use the MethodInfo + target overload:
var provider = new PluginActivations { Temperature = 1.5 };
var method = provider.GetType().GetMethod("ApplyTempered")!;
// ApplyTempered(double x) => Math.Pow(x, 1.0 / Temperature);
var expr = NDExpr.Call(method, provider, NDExpr.Input(0));
iter.ExecuteExpression(expr,
new[] { NPTypeCode.Double }, NPTypeCode.Double);
The provider object's state (Temperature) is captured into the compiled kernel via a DelegateSlots slot ID. Mutating provider.Temperature between calls is visible to subsequent invocations — the slot holds the reference, not a snapshot.
Performance
Benchmarking 1M sqrt on a contiguous float32 array after 300 warmup iterations, Ryzen-class CPU:
| Approach | Time | ns/elem | Notes |
|---|---|---|---|
ForEach with byte-ptr scalar |
2.82 ms | 2.82 | JIT autovectorizes V256 sqrt, no unroll |
ExecuteGeneric<Scalar> byte-ptr |
2.54 ms | 2.54 | Same, no delegate indirection |
ExecuteGeneric<Scalar> typed-ptr branch |
2.79 ms | 2.79 | if (stride == 4) float* branch |
ExecuteGeneric<V256+4x> hand-SIMD |
0.86 ms | 0.86 | User-written Vector256 + 4× unroll |
ExecuteUnary(Sqrt) IL kernel |
0.75 ms | 0.75 | ILKernelGenerator's 4×-unrolled V256 |
Layer 3 is ~3.7× faster than Layer 1/2 scalar code — the gap is entirely explained by loop unrolling, since the JIT does autovectorize a typed-pointer loop into V256 but doesn't issue the four independent vectors per iteration that ILKernelGenerator emits. A user who writes Vector256 + 4× unroll by hand closes the gap to 15% (0.86 vs 0.75 ms).
Layer 1 and Layer 2 give you control and fusion. For any standard elementwise ufunc, Layer 3 is the right default. Drop to Layer 1/2 when fusing several ops (one pass, zero temporaries), when the op isn't in ILKernelGenerator, or when your kernel has a structure the generator can't express.
Custom ops (Tier 3B / Tier 3C) hit the Layer 3 envelope. Because the factory wraps user bodies in the same 4×-unrolled SIMD + remainder + scalar-tail shell, a Tier 3B or Tier 3C kernel for sqrt lands within rounding distance of ExecuteUnary(Sqrt) — the only overhead is the runtime contig check (a few stride comparisons at kernel entry). Fused ops like sqrt(a² + b²) via Tier 3C are typically faster than composing three Layer 3 calls, because there are no intermediate arrays and the whole computation stays in V256 registers between operations.
Custom op overhead breakdown. Tier 3A and Tier 3B kernels share the same NDInnerLoopFunc delegate shape as the baked ufuncs; call overhead is identical. Tier 3C adds:
| Overhead source | When | Cost |
|---|---|---|
| Compile (first call per key) | First ExecuteExpression with a given cache key |
1-10 ms one-time (IL emission + JIT) |
| Auto-key derivation | When cacheKey: null |
~O(tree size) StringBuilder walk — typically < 1 μs |
| Runtime contig check | Every inner-loop entry | 2-4 stride comparisons (~ns) |
| Scalar-strided fallback | When any operand has non-contig inner stride | Per-element pointer arithmetic; JIT autovectorizes post-tier-1 |
Call dispatch (Path A) |
Every element — static method | One call <methodinfo>; JIT may inline |
Call dispatch (Path B/C) |
Every element — instance or delegate | ldc.i4 + DelegateSlots.Lookup + castclass + callvirt (~5-10 ns) |
When fusion pays off. Fusing sqrt(a² + b²) into one Tier 3C kernel avoids materializing the a² and a² + b² intermediates. For 1M float32 elements, that's 8 MB of memory traffic saved per temporary — on a typical 30-GB/s RAM bandwidth, that's ~300 μs per avoided temporary. Fusing 3 ops into one Tier 3C kernel can beat 3 baked Layer 3 calls by 1-2× when memory-bound.
When Call pays off. If the user-supplied method does nontrivial work (e.g. three Math.Exp calls for a numerically-stable sigmoid), the dispatch overhead is a few-percent tax on something that was never going to SIMD anyway. If the method is trivial (x => x * 2), composing out of DSL primitives (NDExpr.Input(0) * NDExpr.Const(2.0)) keeps the SIMD path and runs 3-5× faster. Pick Call when the method is the cheapest thing to write and the kernel isn't a hot path; pick DSL composition when the kernel is profiled and matters.
JIT warmup and measurement
.NET uses tiered compilation: methods first compile to unoptimized tier-0 code, then get promoted to tier-1 after ~100+ calls. Until tier-1, autovectorization doesn't happen — a scalar kernel that runs at 2.5 ms/iter in steady state measures at 70+ ms/iter when warmed up only 10 times.
Under-warmed measurement shows up as:
- Layer 2 scalar shows 50-80 ms instead of 2-5 ms
ExecuteGenericlooks slower thanForEach(it isn't, post-warmup)- Reusing a single iterator looks 50× faster than constructing fresh ones (the reuse path warmed up faster because it kept hitting the same call site)
Benchmark with ≥200 warmup iterations per variant, not just a few. Production code doesn't see this effect because long-running loops are always past tier-1.
Implementation Notes
The bridge is tuned for the JIT in two ways:
Fast-path split.
ExecuteGenericdispatches toExecuteGenericSingle(1 call, inlineable) orExecuteGenericMulti(do/while driver). Small single-call bodies are what the autovectorizer needs to do its job — a do/while with a delegate inside prevents tier-1 SIMD promotion.AggressiveInlining + AggressiveOptimization. Both attributes sit on the fast path so the JIT doesn't punt on inlining due to method size and immediately promotes to tier-1 once discovered hot.
Without these, ExecuteGeneric gets stuck at tier-0 in micro-benchmarks and looks 30× slower than it actually is.
When Does Each Layer Pay Off?
| Layer | Good for | Drawback |
|---|---|---|
Layer 1 (ForEach) |
Exploration, one-off fused kernels, non-standard ops | Delegate allocation per call; no loop unrolling |
Layer 2 (ExecuteGeneric) |
Same as Layer 1 in a hot path | No delegate cost, otherwise same — no loop unrolling |
Layer 3 (Execute*) |
Standard ufuncs already in ILKernelGenerator |
No fusion; one kernel per call |
BufferedReduce |
Axis reductions with casting | Double-loop only worth it with BUFFER + REDUCE |
To reach Layer 3 parity in Layer 2, keep a typed-pointer fast branch and add the 4× unroll yourself. The typed-pointer contiguous branch helps the JIT tier up faster and gives the autovectorizer a trivial pattern to match:
public void Execute(void** p, long* s, long n) {
if (s[0] == sizeof(float) && s[1] == sizeof(float)) {
float* src = (float*)p[0]; float* dst = (float*)p[1];
for (long i = 0; i < n; i++) dst[i] = MathF.Sqrt(src[i]); // JIT → V256
} else {
byte* p0 = (byte*)p[0]; byte* p1 = (byte*)p[1];
long s0 = s[0], s1 = s[1];
for (long i = 0; i < n; i++)
*(float*)(p1 + i * s1) = MathF.Sqrt(*(float*)(p0 + i * s0));
}
}
For maximum throughput, write the 4×-unrolled V256 version in the fast branch — you'll land within 15% of the IL kernel.
Allocations
Layer 3 allocates exactly once per call: the stackalloc stride arrays (NDim longs each). No heap allocation. Layer 2 inlines the entire kernel body into the JIT's codegen of ExecuteGeneric — no allocation at all, not even a delegate. Layer 1 allocates a single delegate per call (closure if it captures anything).
Custom-op tiers:
| Tier | Per-call allocation | One-time allocation |
|---|---|---|
Tier 3A (ExecuteRawIL) |
stackalloc strides + the user's Action<ILGenerator> closure on first compile |
compiled DynamicMethod cached by key; stays live for process lifetime (~2-5 KB native + runtime metadata) |
Tier 3B (ExecuteElementWise) |
stackalloc strides + (on first compile) two Action<ILGenerator> closures |
compiled kernel cached by key |
Tier 3C (ExecuteExpression) |
stackalloc strides + (on first compile) an NDExpr tree allocated by the caller + StringBuilder for the auto-key | compiled kernel cached by key |
Tier 3C with Call |
same as Tier 3C, plus one DelegateSlots entry per unique captured delegate / bound target |
registered references live for process lifetime; see Memory model and lifetime |
The one case where allocations grow without bound is the anti-pattern of constructing a new Call delegate per iteration — each new delegate reference gets a new slot ID and a new cache entry. Register delegates once at startup to avoid this.
Summary
NDIter is how NumSharp turns "iterate these three arrays of possibly-different shapes, types, and strides" into a deterministic schedule of pointer advances. NDIter.Execution.cs is how that schedule becomes a SIMD kernel call.
The core idea. NumPy's C++ templates compile for (i = 0; i < n; i++) c[i] = a[i] + b[i] ahead of time, specialized per type. NumSharp cannot. Instead it emits that same loop as IL via DynamicMethod the first time you ask for it, then caches the JIT-compiled delegate forever. NDIter handles the layout problem (what offsets, in what order), ILKernelGenerator handles the type problem (what opcodes, with what SIMD intrinsics), and NDIter.Execution.cs hands the one to the other.
Three layers. ExecuteBinary / Unary / Reduction / ... for standard ufuncs (this is what you want 90% of the time — it's ~3.7× faster than a JIT-autovectorized scalar loop and ~1.15× faster than hand-written Vector256 + 4× unroll). ExecuteGeneric<TKernel> for custom kernels that need zero dispatch overhead. ForEach with a NDInnerLoopFunc delegate when you're exploring, fusing, or writing something exotic.
Custom ops extend Layer 3. When a baked ufunc doesn't match your problem, three tiers let you reach the same SIMD-unrolled performance envelope without leaving the bridge: ExecuteRawIL (you emit the whole body), ExecuteElementWise (you supply per-element scalar + vector IL; factory wraps the unroll shell), ExecuteExpression (compose with NDExpr — no IL required). Each tier is cached, reuses ILKernelGenerator's emit primitives, and runs through the same ForEach driver as baked ops.
Coalesce first. A 3-D contiguous array should run as one flat SIMD loop, not a triple-nested loop. The iterator does this for you — as long as you don't set flags that disable it (MULTI_INDEX, C_INDEX, F_INDEX).
Buffer when casting or when non-contiguous + SIMD-critical. The iterator copies strided input into aligned contiguous buffers, runs the kernel there, and writes back — correct for the bridge and for a bare ForEach / Iternext() loop alike.
Struct-generic is a template substitute. Constraining a type parameter to struct lets the JIT specialize the method per concrete type at codegen time. For hot inner loops this is indistinguishable from a hand-inlined function. Use it — but remember that scalar kernel code only autovectorizes after tier-1 JIT promotion, which takes ~100+ hot-loop iterations. Microbenchmarks that warm up 10 times will wildly under-report Layer 1/2 performance. Production code never sees this effect.
Simple kernels autovectorize after warmup. Post-tier-1, the JIT autovectorizes both byte-pointer *(float*)(p + i*s) = ... and typed-pointer dst[i] = ... loops into Vector256. If you care about every microsecond, a stride-equality branch with typed pointers in the fast path is slightly more robust and reaches tier-1 faster, but it's not the order-of-magnitude difference you might expect — the Vector256 + 4×-unroll hand-kernel is.
Everything else — flag enums, op_axes encoding, negative-stride flipping, the double-loop reduction schedule — exists to handle corner cases NumPy users write every day without thinking. NumSharp handles them the same way, just translated into a language where we emit IL instead of expanding templates.
See Also
- IL Generation — the kernel side of the bridge
- Broadcasting — stride-0 iteration
- Buffering & Memory — buffer allocation and lifetime