Table of Contents

Fusion — np.evaluate vs unfused chains (and NumPy context)

np.evaluate runs a whole expression tree in one NDIter pass (no intermediates). Fixed-expression gate plus an operand-layout sweep of the flagship a*b+c (C/F/T/strided/bcast — does the fused single-pass win survive non-contiguous operands?), not a dtype/layout matrix — so reported as-is.

NumSharp — fused np.evaluate vs unfused np.* chains (4M elements, best-of-9; (Nx) = unfused ÷ fused, >1 = fusion faster):

correctness cross-checks ok

4M float64, best of 9:
  a*b+c       fused    4.48 ms   unfused    6.97 ms   (1.56x)
  (a-b)/(a+b) fused    3.26 ms   unfused   13.54 ms   (4.16x)
  sum(a*b)    fused    2.44 ms   unfused    3.90 ms   (1.60x)
  sum(af*bf)  fused    1.30 ms   unfused    1.68 ms   (1.29x)  [f32]
  a*b+c out=  fused    3.77 ms   [1-pass fused-into-out]
  i4*2+f8     fused    2.93 ms   unfused    4.18 ms   (1.43x)

  a*b+c across operand layouts (2-D 2000x2000, all 3 operands same layout):
    [C      ] fused    3.68 ms   unfused    6.43 ms   (1.75x)
    [F      ] fused    3.60 ms   unfused    6.67 ms   (1.85x)
    [T      ] fused    3.67 ms   unfused    6.37 ms   (1.74x)
    [strided] fused    3.49 ms   unfused    4.75 ms   (1.36x)
    [bcast  ] fused    1.11 ms   unfused    3.99 ms   (3.60x)

NumPy — absolutes on the same box (context for the unfused column):

numpy 2.4.2, 4M float64, best of 9:
  a*b+c         12.93 ms
  (a-b)/(a+b)   19.64 ms
  sum(a*b)       8.45 ms
  sum(af*bf)     4.19 ms  [f32]
  a*b+c out=     4.96 ms  [two-pass with out=]
  i4*2+f8        9.99 ms
  a*b+c across operand layouts (2-D 2000x2000, unfused):
    [C      ]   12.87 ms
    [F      ]   12.76 ms
    [T      ]   12.84 ms
    [strided]    7.87 ms
    [bcast  ]   12.36 ms