SME2: GEMM & GEMV micro-kernels, Advanced SIMD packing kernels (!532) · Merge requests · Kleidi / KleidiAI

Declan Cox requested to merge qai8cxp_qsi4c32p into main Nov 24, 2025

Matrix multiplication (1×N) micro-kernels for dynamically quantized asymmetric signed 8-bit per-row (QAI8DXP) LHS × symmetric signed 4-bit per-block (QSI4C32P) RHS, both operands packed, with accumulation into single-precision floating-point (F32) output, optimized for SME2 technology.
Matrix multiplication (M×N) micro-kernels for the same quantization scheme (QAI8DXP × QSI4C32P) and accumulation path (F32), optimized for SME2.
Added new RHS packing kernels (K×N, N×K) with Advanced SIMD variants using intrinsics.
Updated existing pack interfaces for consistency (kai_rhs_pack_kxn_qsi4c32p_qsu4c32s1s0).
Integrated kernels into matmul registry and benchmark harness.
Refactored and extended tests to cover SME2 GEMM/GEMV variants.
Updated build files (CMakeLists.txt, Bazel) and CHANGELOG.md.

Signed-off-by: Anitha Raj anitha.raj@arm.com Signed-off-by: Declan Cox declan.cox@arm.com

SME2: GEMM & GEMV micro-kernels, Advanced SIMD packing kernels