Table of Contents

Benchmarks

NumSharp performance lab

Benchmark Dashboard

A compact operating view of the NumSharp vs NumPy benchmark suite: 14 official op suites, all supported dtypes where applicable, three cache tiers, and five subsystem scans for iterator, layout, operand, cast, and fusion behavior.

Snapshot: 2026-06-23 Commit: e3b7c268 NumPy: 2.4.2 Ratio: NumPy_ms / NumSharp_ms Higher is better
Operation cells
1,851

op x dtype x size rows in the official matrix

Large arrays
1.26x

10M-element geomean, 80% of NumPy time

Cache tier
0.90x

100K-element geomean, current main pressure point

Cast subsystem
1,439

wins out of 1,568 comparable cast cells

Operation cells

Legend & How To Read

Ratio NumPy / NumSharp Example: x3 = NumSharp 10s, NumPy 30s
Benchmark row
Each result is one operation x dtype x size timing cell from the latest benchmark snapshot; rollups use comparable measured rows.
Timing fields
Ratios come from raw NumPy_ms and NumSharp_ms timings. A value above 1.00x is a NumSharp win; below 1.00x means NumPy was faster.
Drill-down
Click Status Mix segments, Suite Scoreboard rows, and Dtype Heatmap cards to open the internal top-25 best and worst benchmark rows for that grouping.

Cell

One benchmark row: operation, dtype, and size tier. The dashboard uses best timed runs after warmup.

Reading Ratios

Higher is better. 2.00x means NumSharp is twice as fast as NumPy; 0.50x means NumSharp takes about twice as long.

Performance Bands

Faster: 1.00x and above Close: 0.50x to 1.00x Slower: 0.20x to 0.50x Much slower: below 0.20x No data: pending C# measurement

Status Mix

All measured rows classified by NumPy / NumSharp, including sub-microsecond rows; only no-data cells are separate

4 x100+ faster 925 faster, 1.00-100x 434 close, 0.50-1.00x 291 slower, 0.20-0.50x 128 much slower, <0.20x 69 no data

Suite Scoreboard

Geomean across credible rows. Parity marker is 1.0x.

Statistics 2.24x 39 / 10
Reduction 1.81x 385 / 110
Broadcasting 1.10x 3 / 0
Sorting 1.08x 23 / 13
Creation 1.04x 23 / 21
Unary 0.83x 93 / 120
Arithmetic 0.80x 145 / 197
Selection 0.76x 2 / 4
Logic 0.68x 14 / 25
Comparison 0.64x 17 / 31
Linear algebra 0.57x 2 / 6
Bitwise 0.54x 45 / 68
Manipulation 0.40x 1 / 1

Dtype Heatmap

Credible operation-matrix rows by dtype and cache tier

Subsystem Signals

Result models that the op matrix cannot express

NDIter

1.18x

Iterator operation geomean. Strong reductions and dtype loops, with copy/cast and index math still visible as overhead canaries.

Layout

0.50-1.80x

Layout scans expose the real split: large elementwise wins, while strided/broadcast reductions and decimal sums trail.

Cast

1,439 wins

The full src-to-dst astype grid is broadly ahead. Remaining lag clusters around same-type diagonal copy and bool conversion cases.

Fusion

4.16x

The best fixed expression speedup for fused np.evaluate over NumSharp's unfused chain; broadcast fusion reaches 3.60x.

Function Explorer

Search every named np.* API in the latest matrix, then inspect dtype, size, scenario, and raw timing rows

Loading np.* function performance surface...

Optimization Priorities

Current optimization priorities from the latest snapshot

  1. Shift kernels: vectorize 100K int left/right shift.
  2. Bool bitwise: lift invert, &, |, ^.
  3. Decimal reduce: fix broadcast/sliced sum axis cliffs.
  4. i32 broadcast: repair stride-0 axis sum.
  5. f64 sum: close the 100K reduction gap.
  6. Float predicates: accelerate 100K isnan/isinf/isfinite.
  7. f32 add/mul: improve 100K scalar/literal paths.
  8. f64 add/mul: improve 100K scalar/literal paths.
  9. Float abs/neg: remove 100K C/strided cliffs.
  10. Rounding: tighten 100K f32/f64 floor/ceil/trunc.
  11. Exp/log: reduce f32/f64 mid-tier overhead.
  12. f16 unary: lift sign/math scalar fallback.
  13. Int mean: improve 10M int64/uint64 mean.
  14. Linear algebra: revisit large f64 matmul/dot.
  15. Flatten: fix the 100K copy/cast trough.
  16. Astype small: reduce scalar/1K setup cost.
  17. Ravel T: close transpose ravel mid-size gap.
  18. less->bool: specialize bool output loops.
  19. Index math: speed scalar/1K unravel/ravel-multi.
  20. NDIter copy: raise 100K copy/cast geomean.
  21. Chunk width: optimize tiny inner width dispatch.
  22. f16 operands: improve strided/reversed/broadcast cases.
  23. Cast diagonal: speed same-type copy cells.
  24. Cast bool: clean remaining * -> bool losses.
  25. Fusion: broaden fused-expression coverage.

Full Reports

Detailed tables remain available for traceability