Benchmarks
Benchmark Dashboard
A compact operating view of the NumSharp vs NumPy benchmark suite: 14 official op suites, all supported dtypes where applicable, three cache tiers, and five subsystem scans for iterator, layout, operand, cast, and fusion behavior.
op x dtype x size rows in the official matrix
10M-element geomean, 80% of NumPy time
100K-element geomean, current main pressure point
wins out of 1,568 comparable cast cells
Legend & How To Read
Cell
One benchmark row: operation, dtype, and size tier. The dashboard uses best timed runs after warmup.
Reading Ratios
Higher is better. 2.00x means NumSharp is twice as fast as NumPy; 0.50x means NumSharp takes about twice as long.
Performance Bands
Status Mix
All measured rows classified by NumPy / NumSharp, including sub-microsecond rows; only no-data cells are separate
Suite Scoreboard
Geomean across credible rows. Parity marker is 1.0x.
Dtype Heatmap
Credible operation-matrix rows by dtype and cache tier
Subsystem Signals
Result models that the op matrix cannot express
NDIter
1.18xIterator operation geomean. Strong reductions and dtype loops, with copy/cast and index math still visible as overhead canaries.
Layout
0.50-1.80xLayout scans expose the real split: large elementwise wins, while strided/broadcast reductions and decimal sums trail.
Cast
1,439 winsThe full src-to-dst astype grid is broadly ahead. Remaining lag clusters around same-type diagonal copy and bool conversion cases.
Fusion
4.16xThe best fixed expression speedup for fused np.evaluate over NumSharp's unfused chain; broadcast fusion reaches 3.60x.
Function Explorer
Search every named np.* API in the latest matrix, then inspect dtype, size, scenario, and raw timing rows
Optimization Priorities
Current optimization priorities from the latest snapshot
- Shift kernels: vectorize 100K int left/right shift.
- Bool bitwise: lift
invert,&,|,^. - Decimal reduce: fix broadcast/sliced sum axis cliffs.
- i32 broadcast: repair stride-0 axis sum.
- f64 sum: close the 100K reduction gap.
- Float predicates: accelerate 100K
isnan/isinf/isfinite. - f32 add/mul: improve 100K scalar/literal paths.
- f64 add/mul: improve 100K scalar/literal paths.
- Float abs/neg: remove 100K C/strided cliffs.
- Rounding: tighten 100K f32/f64 floor/ceil/trunc.
- Exp/log: reduce f32/f64 mid-tier overhead.
- f16 unary: lift sign/math scalar fallback.
- Int mean: improve 10M int64/uint64 mean.
- Linear algebra: revisit large f64 matmul/dot.
- Flatten: fix the 100K copy/cast trough.
- Astype small: reduce scalar/1K setup cost.
- Ravel T: close transpose ravel mid-size gap.
- less->bool: specialize bool output loops.
- Index math: speed scalar/1K unravel/ravel-multi.
- NDIter copy: raise 100K copy/cast geomean.
- Chunk width: optimize tiny inner width dispatch.
- f16 operands: improve strided/reversed/broadcast cases.
- Cast diagonal: speed same-type copy cells.
- Cast bool: clean remaining
* -> boollosses. - Fusion: broaden fused-expression coverage.
Full Reports
Detailed tables remain available for traceability