Class SizeBucketedBufferPool
Thread-safe pool of recently-freed unmanaged buffers, bucketed by exact byte size. Acts as a tcache-like front for Alloc(nuint) / Free(void*): a successful Take is just a pop from a per-size ConcurrentStack<T>; a failed Take falls through to NativeMemory.Alloc.
WHY THIS EXISTS
Profiling NumSharp's binary-op pipeline shows ~500 µs of every
1024×1024 float32 a + b is spent on first-touch overhead of the
fresh output buffer — page-faulting each cache line on first write
plus the kernel-mode cost of Alloc(nuint)
reaching out to the OS for a fresh chunk. NumPy hides the same
cost via glibc tcache reuse: a buffer freed by the previous op is
handed back warm to the next call. This pool replicates that
behaviour at the NumSharp layer.
SIZING POLICY
• The window is MinPoolableBytes (1 B) to MaxPoolableBytes (64 MiB) — Wave 2.4 opened both ends: the 1000-element float32 result (4000 B) missed the old 4 KiB floor by 96 bytes, and every 4M-element output (16–32 MiB) missed the old 1 MiB cap, paying ~2× in demand-zero page faults per call (in-place toggle-verified: P1 contig add 4M 3.37→1.74 ms). • Above the cap: no pooling. Huge buffers are rare and the memory cost of keeping them around dwarfs the alloc-cost savings. • Per-bucket cap of MaxBuffersPerBucket entries (MaxBuffersPerLargeBucket at ≥ 1 MiB) to bound peak resident memory. • Bucket key is the EXACT byte count requested (no rounding). Same-size repeated allocs are the dominant pattern in element- wise ops; rounding to power-of-2 would waste memory and break exact-fit reuse for typical workloads (e.g. 4 MiB float32 1K×1K).
CORRECTNESS
• Stored buffers are NOT zero-filled. Callers that need zeroed memory must zero on Take (the same contract NativeMemory.Alloc has). • Buffer ownership transfers fully on Take: the pool no longer references the pointer, so subsequent Return calls aren't at risk of double-pop. • Return is best-effort: when the bucket is full or the size falls outside the pool's window the pointer is freed immediately via Free(void*).
public static class SizeBucketedBufferPool
- Inheritance
-
SizeBucketedBufferPool
- Inherited Members
Fields
GuardPagesEnabled
Opt-in diagnostic page-heap mode (env NUMSHARP_GUARD_PAGES=1, Windows only).
When on, every Take(long)/TakeZeroed(long) hands back a buffer whose
last byte abuts an inaccessible guard page (AllocGuarded(long, out nint)),
pooling is bypassed, and any one-past-the-end write faults INSTANTLY at the offending
access — used to localise an indexing out-of-bounds write to the exact case/site.
Default OFF (the field is read once at startup) so production paths are untouched.
public static readonly bool GuardPagesEnabled
Field Value
LargeBucketThreshold
Bucket sizes at/above this hold at most MaxBuffersPerLargeBucket buffers.
public const long LargeBucketThreshold = 1048576
Field Value
MaxBuffersPerBucket
Maximum number of buffers kept per exact-size bucket (below LargeBucketThreshold).
public const int MaxBuffersPerBucket = 8
Field Value
MaxBuffersPerLargeBucket
Per-bucket cap for large (≥ 1 MiB) buckets — bounds peak resident memory.
public const int MaxBuffersPerLargeBucket = 2
Field Value
MaxPoolableBytes
Maximum allocation size to pool (bytes). Wave 2.4 raised this from 1 MiB to 64 MiB: the dominant benchmark/e2e shapes (4M elements = 16 MiB float32 / 32 MiB float64 outputs) all missed the old cap and paid ~0.3–0.4 ms of first-touch page faults per call — the "allocator tax" residual on every measured e2e strided row. NumPy gets the same reuse for free from glibc's arena caching. Resident growth is bounded by the per-bucket cap, which drops to MaxBuffersPerLargeBucket at LargeBucketThreshold (realistic workloads keep one or two hot output shapes — exactly the tcache pattern).
public const long MaxPoolableBytes = 67108864
Field Value
MinPoolableBytes
Minimum allocation size to pool (bytes). Wave 2.4 lowered this from 4096 to 1: the small-N hot path (e.g. a 1000-element float32 ufunc result = 4000 bytes) sat just under the old threshold and paid a fresh NativeMemory.Alloc + GC memory pressure pair on EVERY call. Tiny buckets cost almost nothing resident (8 × size) and a pool hit skips the pressure churn entirely (see the pool-owned pressure accounting below).
public const long MinPoolableBytes = 1
Field Value
Properties
Hits
How many Take calls served from the pool.
public static long Hits { get; }
Property Value
Misses
How many Take calls fell through to NativeMemory.Alloc.
public static long Misses { get; }
Property Value
Returns
How many Return calls accepted the buffer into the pool.
public static long Returns { get; }
Property Value
ReturnsFreed
How many Return calls freed the buffer (bucket full / out-of-range).
public static long ReturnsFreed { get; }
Property Value
ZeroedAllocs
How many TakeZeroed calls went straight to calloc (the np.zeros fast path).
public static long ZeroedAllocs { get; }
Property Value
Methods
Clear()
Drain every pooled buffer immediately (testing / memory pressure). Calls Free(void*) on each. No pressure adjustment — pooled buffers carry none (live-state accounting).
public static void Clear()
ResetCounters()
Reset all counters. Diagnostic only.
public static void ResetCounters()
Return(nint, long)
Return a buffer to the pool. Caller transfers ownership; do NOT touch the pointer after the call.
If the size falls outside the pool window or the bucket is already at capacity, the buffer is freed via Free(void*) instead of being kept.
public static void Return(nint ptr, long bytes)
Parameters
ptrnintPointer obtained from Take(long) or a paired NativeMemory.Alloc.
byteslongSize in bytes originally requested.
Take(long)
Take a buffer of the given byte size. Returns either a reused warm buffer or a fresh allocation; either way the caller owns it and must eventually Return or Free it. The memory is NOT zeroed.
public static nint Take(long bytes)
Parameters
byteslongByte size of the buffer. Must be > 0.
Returns
TakeZeroed(long)
Take a ZERO-INITIALIZED buffer of the given byte size. This is
the np.zeros / fill-with-default fast path and the analogue of
NumPy's npy_alloc_cache_zero / PyDataMem_NEW_ZEROED.
WHY calloc INSTEAD OF Take + memset
AllocZeroed(nuint) calls the CRT
calloc. For any non-trivial size the CRT/OS serves the
request from fresh, copy-on-write zero pages (Windows
VirtualAlloc / Linux mmap(MAP_ANONYMOUS)): the
pages are only physically committed and zeroed by the kernel
lazily, on first write. So zeroing a 10M-element (80 MB) block
costs ~0.01 ms instead of the ~14 ms an explicit element fill —
or even a Clear(void*, nuint) memset (~21 ms) —
pays to touch every page up front.
The dirty same-size bucket cache used by Take(long) is intentionally NOT consulted: a recycled buffer is dirty and would force a full memset, touching every page and throwing away the entire lazy-zero advantage for exactly the large sizes that matter. NumPy makes the same call — its zero cache only engages below 1 KiB, a regime where the CRT's own low- fragmentation heap already supplies that small-block reuse for our calloc.
OWNERSHIP & PRESSURE
The caller owns the returned pointer and must eventually Return(nint, long) or free it; a calloc'd pointer is a normal NativeMemory allocation, so Return may pool it for later (non-zero) reuse by Take(long) or free it via Free(void*). GC memory pressure is registered here exactly as Take(long) does, so the paired Return(nint, long) balances it (live-state accounting).
public static nint TakeZeroed(long bytes)
Parameters
byteslongByte size of the buffer. Must be >= 0.