Draft: Add SME2 3x3 depthwise int8 kernel (stride=1, per-ch sym weight quant, float scale requant)
- Implements a planar depthwise convolution kernel: 3x3, stride=1, int8.
- Inputs and outputs use per-tensor asymmetric quantization.
- Weights use per-channel symmetric quantization (support for asymmetric if required).
- Accumulate in int32; epilogue applies per-channel float scale requantization with saturation to int8.
- Includes a required weight packing kernel, packing in VL-sized blocks for SME2 execution.
- Adds initial source and headers for the kernel, packing kernel (for weights), unit tests.
Signed-off-by: Declan Cox declan.cox@arm.com