Class SimdMatMul
High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.
Key optimizations:
- GEBP (General Block Panel) algorithm with cache blocking
- Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
- 8x16 micro-kernel with 16 Vector256 accumulators
- FMA (Fused Multiply-Add) for 2x FLOP throughput
- 4x k-loop unrolling for instruction-level parallelism
Stride-aware variants (see SimdMatMul.Strided.cs / SimdMatMul.Double.cs) accept (stride0, stride1) for each operand so transposed / sliced NDArray views can be matmul'd without materializing contiguous copies.
public static class SimdMatMul
- Inheritance
-
SimdMatMul
- Inherited Members
Methods
MatMulDouble(double*, long, long, double*, long, long, double*, long, long, long)
Stride-aware double matrix multiply: C = A * B. A is logical (M, K) with strides (aStride0, aStride1) in elements. B is logical (K, N) with strides (bStride0, bStride1) in elements. C is written as M×N row-major contiguous (ldc = N).
public static void MatMulDouble(double* A, long aStride0, long aStride1, double* B, long bStride0, long bStride1, double* C, long M, long N, long K)
Parameters
Adouble*aStride0longaStride1longBdouble*bStride0longbStride1longCdouble*MlongNlongKlong
MatMulFloat(float*, long, long, float*, long, long, float*, long, long, long)
Stride-aware matrix multiply: C = A * B. A is logical (M, K) with strides (aStride0, aStride1) in elements. B is logical (K, N) with strides (bStride0, bStride1) in elements. C is written as M×N row-major contiguous (ldc = N).
Passing (aStride0=K, aStride1=1, bStride0=N, bStride1=1) reproduces the contiguous-input behavior of MatMulFloat(float*, float*, float*, long, long, long).
public static void MatMulFloat(float* A, long aStride0, long aStride1, float* B, long bStride0, long bStride1, float* C, long M, long N, long K)
Parameters
Afloat*aStride0longaStride1longBfloat*bStride0longbStride1longCfloat*MlongNlongKlong
MatMulFloat(float*, float*, float*, long, long, long)
Matrix multiply: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices must be row-major contiguous.
public static void MatMulFloat(float* A, float* B, float* C, long M, long N, long K)
Parameters
Remarks
Supports long dimensions for arrays > 2B elements. Cache blocking (MC, KC, MR, NR) keeps inner loops within int range. Outer loops and index calculations use long arithmetic.