Table of Contents

Class SizeBucketedBufferPool

Namespace
NumSharp.Backends.Unmanaged.Pooling
Assembly
NumSharp.dll

Thread-safe pool of recently-freed unmanaged buffers, bucketed by exact byte size. Acts as a tcache-like front for Alloc(nuint) / Free(void*): a successful Take is just a pop from a per-size ConcurrentStack<T>; a failed Take falls through to NativeMemory.Alloc.

WHY THIS EXISTS

Profiling NumSharp's binary-op pipeline shows ~500 µs of every 1024×1024 float32 a + b is spent on first-touch overhead of the fresh output buffer — page-faulting each cache line on first write plus the kernel-mode cost of Alloc(nuint) reaching out to the OS for a fresh chunk. NumPy hides the same cost via glibc tcache reuse: a buffer freed by the previous op is handed back warm to the next call. This pool replicates that behaviour at the NumSharp layer.

SIZING POLICY

• The window is MinPoolableBytes (1 B) to MaxPoolableBytes (64 MiB) — Wave 2.4 opened both ends: the 1000-element float32 result (4000 B) missed the old 4 KiB floor by 96 bytes, and every 4M-element output (16–32 MiB) missed the old 1 MiB cap, paying ~2× in demand-zero page faults per call (in-place toggle-verified: P1 contig add 4M 3.37→1.74 ms). • Above the cap: no pooling. Huge buffers are rare and the memory cost of keeping them around dwarfs the alloc-cost savings. • Per-bucket cap of MaxBuffersPerBucket entries (MaxBuffersPerLargeBucket at ≥ 1 MiB) to bound peak resident memory. • Bucket key is the EXACT byte count requested (no rounding). Same-size repeated allocs are the dominant pattern in element- wise ops; rounding to power-of-2 would waste memory and break exact-fit reuse for typical workloads (e.g. 4 MiB float32 1K×1K).

CORRECTNESS

• Stored buffers are NOT zero-filled. Callers that need zeroed memory must zero on Take (the same contract NativeMemory.Alloc has). • Buffer ownership transfers fully on Take: the pool no longer references the pointer, so subsequent Return calls aren't at risk of double-pop. • Return is best-effort: when the bucket is full or the size falls outside the pool's window the pointer is freed immediately via Free(void*).

public static class SizeBucketedBufferPool
Inheritance
SizeBucketedBufferPool
Inherited Members

Fields

GuardPagesEnabled

Opt-in diagnostic page-heap mode (env NUMSHARP_GUARD_PAGES=1, Windows only). When on, every Take(long)/TakeZeroed(long) hands back a buffer whose last byte abuts an inaccessible guard page (AllocGuarded(long, out nint)), pooling is bypassed, and any one-past-the-end write faults INSTANTLY at the offending access — used to localise an indexing out-of-bounds write to the exact case/site. Default OFF (the field is read once at startup) so production paths are untouched.

public static readonly bool GuardPagesEnabled

Field Value

bool

LargeBucketThreshold

Bucket sizes at/above this hold at most MaxBuffersPerLargeBucket buffers.

public const long LargeBucketThreshold = 1048576

Field Value

long

MaxBuffersPerBucket

Maximum number of buffers kept per exact-size bucket (below LargeBucketThreshold).

public const int MaxBuffersPerBucket = 8

Field Value

int

MaxBuffersPerLargeBucket

Per-bucket cap for large (≥ 1 MiB) buckets — bounds peak resident memory.

public const int MaxBuffersPerLargeBucket = 2

Field Value

int

MaxPoolableBytes

Maximum allocation size to pool (bytes). Wave 2.4 raised this from 1 MiB to 64 MiB: the dominant benchmark/e2e shapes (4M elements = 16 MiB float32 / 32 MiB float64 outputs) all missed the old cap and paid ~0.3–0.4 ms of first-touch page faults per call — the "allocator tax" residual on every measured e2e strided row. NumPy gets the same reuse for free from glibc's arena caching. Resident growth is bounded by the per-bucket cap, which drops to MaxBuffersPerLargeBucket at LargeBucketThreshold (realistic workloads keep one or two hot output shapes — exactly the tcache pattern).

public const long MaxPoolableBytes = 67108864

Field Value

long

MinPoolableBytes

Minimum allocation size to pool (bytes). Wave 2.4 lowered this from 4096 to 1: the small-N hot path (e.g. a 1000-element float32 ufunc result = 4000 bytes) sat just under the old threshold and paid a fresh NativeMemory.Alloc + GC memory pressure pair on EVERY call. Tiny buckets cost almost nothing resident (8 × size) and a pool hit skips the pressure churn entirely (see the pool-owned pressure accounting below).

public const long MinPoolableBytes = 1

Field Value

long

Properties

Hits

How many Take calls served from the pool.

public static long Hits { get; }

Property Value

long

Misses

How many Take calls fell through to NativeMemory.Alloc.

public static long Misses { get; }

Property Value

long

Returns

How many Return calls accepted the buffer into the pool.

public static long Returns { get; }

Property Value

long

ReturnsFreed

How many Return calls freed the buffer (bucket full / out-of-range).

public static long ReturnsFreed { get; }

Property Value

long

ZeroedAllocs

How many TakeZeroed calls went straight to calloc (the np.zeros fast path).

public static long ZeroedAllocs { get; }

Property Value

long

Methods

Clear()

Drain every pooled buffer immediately (testing / memory pressure). Calls Free(void*) on each. No pressure adjustment — pooled buffers carry none (live-state accounting).

public static void Clear()

ResetCounters()

Reset all counters. Diagnostic only.

public static void ResetCounters()

Return(nint, long)

Return a buffer to the pool. Caller transfers ownership; do NOT touch the pointer after the call.

If the size falls outside the pool window or the bucket is already at capacity, the buffer is freed via Free(void*) instead of being kept.

public static void Return(nint ptr, long bytes)

Parameters

ptr nint

Pointer obtained from Take(long) or a paired NativeMemory.Alloc.

bytes long

Size in bytes originally requested.

Take(long)

Take a buffer of the given byte size. Returns either a reused warm buffer or a fresh allocation; either way the caller owns it and must eventually Return or Free it. The memory is NOT zeroed.

public static nint Take(long bytes)

Parameters

bytes long

Byte size of the buffer. Must be > 0.

Returns

nint

TakeZeroed(long)

Take a ZERO-INITIALIZED buffer of the given byte size. This is the np.zeros / fill-with-default fast path and the analogue of NumPy's npy_alloc_cache_zero / PyDataMem_NEW_ZEROED.

WHY calloc INSTEAD OF Take + memset

AllocZeroed(nuint) calls the CRT calloc. For any non-trivial size the CRT/OS serves the request from fresh, copy-on-write zero pages (Windows VirtualAlloc / Linux mmap(MAP_ANONYMOUS)): the pages are only physically committed and zeroed by the kernel lazily, on first write. So zeroing a 10M-element (80 MB) block costs ~0.01 ms instead of the ~14 ms an explicit element fill — or even a Clear(void*, nuint) memset (~21 ms) — pays to touch every page up front.

The dirty same-size bucket cache used by Take(long) is intentionally NOT consulted: a recycled buffer is dirty and would force a full memset, touching every page and throwing away the entire lazy-zero advantage for exactly the large sizes that matter. NumPy makes the same call — its zero cache only engages below 1 KiB, a regime where the CRT's own low- fragmentation heap already supplies that small-block reuse for our calloc.

OWNERSHIP & PRESSURE

The caller owns the returned pointer and must eventually Return(nint, long) or free it; a calloc'd pointer is a normal NativeMemory allocation, so Return may pool it for later (non-zero) reuse by Take(long) or free it via Free(void*). GC memory pressure is registered here exactly as Take(long) does, so the paired Return(nint, long) balances it (live-state accounting).

public static nint TakeZeroed(long bytes)

Parameters

bytes long

Byte size of the buffer. Must be >= 0.

Returns

nint