Table of Contents

Class SimdMatMul

Namespace
NumSharp.Backends.Kernels
Assembly
NumSharp.dll

High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.

Key optimizations:

  • GEBP (General Block Panel) algorithm with cache blocking
  • Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
  • 8x16 micro-kernel with 16 Vector256 accumulators
  • FMA (Fused Multiply-Add) for 2x FLOP throughput
  • 4x k-loop unrolling for instruction-level parallelism
public static class SimdMatMul
Inheritance
SimdMatMul
Inherited Members

Methods

MatMulFloat(float*, float*, float*, int, int, int)

Matrix multiply: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices must be row-major contiguous.

public static void MatMulFloat(float* A, float* B, float* C, int M, int N, int K)

Parameters

A float*
B float*
C float*
M int
N int
K int