Class SimdMatMul
High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.
Key optimizations:
- GEBP (General Block Panel) algorithm with cache blocking
- Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
- 8x16 micro-kernel with 16 Vector256 accumulators
- FMA (Fused Multiply-Add) for 2x FLOP throughput
- 4x k-loop unrolling for instruction-level parallelism
public static class SimdMatMul
- Inheritance
-
SimdMatMul
- Inherited Members
Methods
MatMulFloat(float*, float*, float*, int, int, int)
Matrix multiply: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices must be row-major contiguous.
public static void MatMulFloat(float* A, float* B, float* C, int M, int N, int K)