GEMM and LHS packing kernel <- f16 LHS x QSI4c32p RHS
Micro kernels to compute matrix multiplication of packed LHS matrix with f16 and RHS matrix with symmetric 4-bit integer with per channel quantisation and accumulation into a single-precision matrix. This MR also includes the kernel to scale and pack the LHS matrix F32 -> f16.
the GEMM kernel has been optimised for SME2
Signed-off-by: Hugo OKeeffe hugo.okeeffe@arm.com