Benchmarks

NumSharp performance lab

Benchmark Dashboard

A compact operating view of the NumSharp vs NumPy benchmark suite: 14 official op suites, all supported dtypes where applicable, three cache tiers, and five subsystem scans for iterator, layout, operand, cast, and fusion behavior.

Snapshot: 2026-06-23 Commit: e3b7c268 NumPy: 2.4.2 Ratio: NumPy_ms / NumSharp_ms Higher is better

Operation cells

1,851

op x dtype x size rows in the official matrix

Large arrays

1.26x

10M-element geomean, 80% of NumPy time

Cache tier

0.90x

100K-element geomean, current main pressure point

Cast subsystem

1,439

wins out of 1,568 comparable cast cells

Operation cells

Legend & How To Read

Benchmark row

Each result is one operation x dtype x size timing cell from the latest benchmark snapshot; rollups use comparable measured rows.

Timing fields

Ratios come from raw NumPy_ms and NumSharp_ms timings. A value above 1.00x is a NumSharp win; below 1.00x means NumPy was faster.

Drill-down

Click Status Mix segments, Suite Scoreboard rows, and Dtype Heatmap cards to open the internal top-25 best and worst benchmark rows for that grouping.

Cell

One benchmark row: operation, dtype, and size tier. The dashboard uses best timed runs after warmup.

Reading Ratios

Ratio is NumPy / NumSharp. Higher is better: 3.00x means NumSharp took 10s while NumPy took 30s; 0.50x means NumSharp takes about twice as long.

Performance Bands

Faster: 1.00x and above Close: 0.50x to 1.00x Slower: 0.20x to 0.50x Much slower: below 0.20x No data: pending C# measurement

Status Mix

All measured rows classified by NumPy / NumSharp, including sub-microsecond rows; only no-data cells are separate

4 x100+ faster 925 faster, 1.00-100x 434 close, 0.50-1.00x 291 slower, 0.20-0.50x 128 much slower, <0.20x 69 no data

Suite Scoreboard

Geomean across credible rows. Parity marker is 1.0x.

Statistics 2.24x 39 / 10

Reduction 1.81x 385 / 110

Broadcasting 1.10x 3 / 0

Sorting 1.08x 23 / 13

Creation 1.04x 23 / 21

Unary 0.83x 93 / 120

Arithmetic 0.80x 145 / 197

Selection 0.76x 2 / 4

Logic 0.68x 14 / 25

Comparison 0.64x 17 / 31

Linear algebra 0.57x 2 / 6

Bitwise 0.54x 45 / 68

Manipulation 0.40x 1 / 1

Dtype Heatmap

Credible operation-matrix rows by dtype and cache tier

uint82.00x61 rows

int81.84x63 rows

uint161.44x65 rows

int161.39x66 rows

float161.28x224 rows

uint321.09x65 rows

complex1281.04x64 rows

float321.01x233 rows

float640.95x257 rows

int320.93x111 rows

int640.75x118 rows

uint640.74x63 rows

bool0.30x8 rows

Subsystem Signals

Result models that the op matrix cannot express

NDIter

1.18x

Iterator operation geomean. Strong reductions and dtype loops, with copy/cast and index math still visible as overhead canaries.

Layout

0.50-1.80x

Layout scans expose the real split: large elementwise wins, while strided/broadcast reductions and decimal sums trail.

Cast

1,439 wins

The full src-to-dst astype grid is broadly ahead. Remaining lag clusters around same-type diagonal copy and bool conversion cases.

Fusion

4.16x

The best fixed expression speedup for fused np.evaluate over NumSharp's unfused chain; broadcast fusion reaches 3.60x.

Function Explorer

Search every named np.* API in the latest matrix, then inspect dtype, size, scenario, and raw timing rows

Loading np.* function performance surface...

Optimization Priorities

Current optimization priorities from the latest snapshot

Shift kernels: vectorize 100K int left/right shift.
Bool bitwise: lift invert, &, |, ^.
Decimal reduce: fix broadcast/sliced sum axis cliffs.
i32 broadcast: repair stride-0 axis sum.
f64 sum: close the 100K reduction gap.
Float predicates: accelerate 100K isnan/isinf/isfinite.
f32 add/mul: improve 100K scalar/literal paths.
f64 add/mul: improve 100K scalar/literal paths.
Float abs/neg: remove 100K C/strided cliffs.
Rounding: tighten 100K f32/f64 floor/ceil/trunc.
Exp/log: reduce f32/f64 mid-tier overhead.
f16 unary: lift sign/math scalar fallback.
Int mean: improve 10M int64/uint64 mean.
Linear algebra: revisit large f64 matmul/dot.
Flatten: fix the 100K copy/cast trough.
Astype small: reduce scalar/1K setup cost.
Ravel T: close transpose ravel mid-size gap.
less->bool: specialize bool output loops.
Index math: speed scalar/1K unravel/ravel-multi.
NDIter copy: raise 100K copy/cast geomean.
Chunk width: optimize tiny inner width dispatch.
f16 operands: improve strided/reversed/broadcast cases.
Cast diagonal: speed same-type copy cells.
Cast bool: clean remaining * -> bool losses.
Fusion: broaden fused-expression coverage.

Full Reports

Detailed tables remain available for traceability

Snapshot manifest Unified report Cast matrix NDIter results Layout matrix Operand layouts Fusion results IL generation

Table of Contents