Table of Contents

Class SimdMatMul

Namespace
NumSharp.Backends.Kernels
Assembly
NumSharp.dll

High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.

Key optimizations:

  • GEBP (General Block Panel) algorithm with cache blocking
  • Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
  • 8x16 micro-kernel with 16 Vector256 accumulators
  • FMA (Fused Multiply-Add) for 2x FLOP throughput
  • 4x k-loop unrolling for instruction-level parallelism

Stride-aware variants (see SimdMatMul.Strided.cs / SimdMatMul.Double.cs) accept (stride0, stride1) for each operand so transposed / sliced NDArray views can be matmul'd without materializing contiguous copies.

public static class SimdMatMul
Inheritance
SimdMatMul
Inherited Members

Methods

MatMulDouble(double*, long, long, double*, long, long, double*, long, long, long)

Stride-aware double matrix multiply: C = A * B. A is logical (M, K) with strides (aStride0, aStride1) in elements. B is logical (K, N) with strides (bStride0, bStride1) in elements. C is written as M×N row-major contiguous (ldc = N).

public static void MatMulDouble(double* A, long aStride0, long aStride1, double* B, long bStride0, long bStride1, double* C, long M, long N, long K)

Parameters

A double*
aStride0 long
aStride1 long
B double*
bStride0 long
bStride1 long
C double*
M long
N long
K long

MatMulFloat(float*, long, long, float*, long, long, float*, long, long, long)

Stride-aware matrix multiply: C = A * B. A is logical (M, K) with strides (aStride0, aStride1) in elements. B is logical (K, N) with strides (bStride0, bStride1) in elements. C is written as M×N row-major contiguous (ldc = N).

Passing (aStride0=K, aStride1=1, bStride0=N, bStride1=1) reproduces the contiguous-input behavior of MatMulFloat(float*, float*, float*, long, long, long).

public static void MatMulFloat(float* A, long aStride0, long aStride1, float* B, long bStride0, long bStride1, float* C, long M, long N, long K)

Parameters

A float*
aStride0 long
aStride1 long
B float*
bStride0 long
bStride1 long
C float*
M long
N long
K long

MatMulFloat(float*, float*, float*, long, long, long)

Matrix multiply: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices must be row-major contiguous.

public static void MatMulFloat(float* A, float* B, float* C, long M, long N, long K)

Parameters

A float*
B float*
C float*
M long
N long
K long

Remarks

Supports long dimensions for arrays > 2B elements. Cache blocking (MC, KC, MR, NR) keeps inner loops within int range. Outer loops and index calculations use long arithmetic.