Refactor RHS packing function for F32 <- QAI8DXP x QSU4C32
- Rename the packing function to include the the bf16 scale factor
- Optimize the scalar variant. The new implementation is ~1.5x faster than the previous one
Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com