Table of Contents

Class SimdMatMul

Namespace
NumSharp.Backends.Kernels
Assembly
NumSharp.dll

High-performance SIMD matrix multiplication with cache blocking and panel packing. Single-threaded implementation achieving ~20 GFLOPS on modern CPUs.

Key optimizations:

  • GEBP (General Block Panel) algorithm with cache blocking
  • Full panel packing: A as [kc][MR] panels, B as [kc][NR] panels
  • 8x16 micro-kernel with 16 Vector256 accumulators
  • FMA (Fused Multiply-Add) for 2x FLOP throughput
  • 4x k-loop unrolling for instruction-level parallelism
public static class SimdMatMul
Inheritance
SimdMatMul
Inherited Members

Methods

MatMulFloat(float*, float*, float*, long, long, long)

Matrix multiply: C = A * B A is [M x K], B is [K x N], C is [M x N] All matrices must be row-major contiguous.

public static void MatMulFloat(float* A, float* B, float* C, long M, long N, long K)

Parameters

A float*
B float*
C float*
M long
N long
K long

Remarks

Supports long dimensions for arrays > 2B elements. Cache blocking (MC, KC, MR, NR) keeps inner loops within int range. Outer loops and index calculations use long arithmetic.