Draft: SME2: GEMM & GEMV micro-kernels, Advanced SIMD packing kernels
-
Matrix multiplication (1×N) micro-kernels for dynamically quantized asymmetric signed 8-bit per-row (QAI8DXP) LHS × symmetric signed 4-bit per-block (QSI4C32P) RHS, both operands packed, with accumulation into single-precision floating-point (F32) output, optimized for SME2 technology.
-
Matrix multiplication (M×N) micro-kernels for the same quantization scheme (QAI8DXP × QSI4C32P) and accumulation path (F32), optimized for SME2.
-
Added new RHS packing kernels (K×N, N×K) with Advanced SIMD variants using intrinsics.
-
Updated existing pack interfaces for consistency (kai_rhs_pack_kxn_qsi4c32p_qsu4c32s1s0).
-
Integrated kernels into matmul registry and benchmark harness.
-
Refactored and extended tests to cover SME2 GEMM/GEMV variants.
-
Updated build files (CMakeLists.txt, Bazel) and CHANGELOG.md.
Signed-off-by: Anitha Raj anitha.raj@arm.com Signed-off-by: Declan Cox declan.cox@arm.com