Optimizes RHS packing qsu4c32s16s0->qsi4c32pscalef16
Optimizes this RHS packing by vectorizing the XOR operation. This is done for segment lenghts of 4 or 8 bytes. The unoptimized path is used for any other segment length.
Signed-off-by: Dan Johansson dan.johansson@arm.com