Remove OpenCL padding: CLReductionOperationKernel

Change the parallel implementation across the X, now every thread computes one row
Add missing test for MEAN_SUM
Make reduction on any axis != 0 work with num_channels > 1

Resolve COMPMID-3917

Signed-off-by: Giorgio Arena <giorgio.arena@arm.com>
Change-Id: Ib0f99540104e3c253bcd1ea637833db533f5e76e
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5522
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Manuel Bottini <manuel.bottini@arm.com>
Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
diff --git a/src/core/CL/cl_kernels/helpers_asymm.h b/src/core/CL/cl_kernels/helpers_asymm.h
index 27878cd..562c5d3 100644
--- a/src/core/CL/cl_kernels/helpers_asymm.h
+++ b/src/core/CL/cl_kernels/helpers_asymm.h
@@ -425,9 +425,22 @@
 QUANTIZE_IMPL(char, 1)
 QUANTIZE_IMPL(uint, 1)
 QUANTIZE_IMPL(int, 1)
+QUANTIZE_IMPL(uchar, 2)
+QUANTIZE_IMPL(char, 2)
+QUANTIZE_IMPL(uint, 2)
+QUANTIZE_IMPL(int, 2)
+QUANTIZE_IMPL(uchar, 3)
+QUANTIZE_IMPL(char, 3)
+QUANTIZE_IMPL(uint, 3)
+QUANTIZE_IMPL(int, 3)
 QUANTIZE_IMPL(uchar, 4)
 QUANTIZE_IMPL(ushort, 4)
 QUANTIZE_IMPL(short, 4)
+QUANTIZE_IMPL(int, 4)
+QUANTIZE_IMPL(uchar, 8)
+QUANTIZE_IMPL(char, 8)
+QUANTIZE_IMPL(uint, 8)
+QUANTIZE_IMPL(int, 8)
 QUANTIZE_IMPL(uchar, 16)
 QUANTIZE_IMPL(char, 16)
 QUANTIZE_IMPL(ushort, 16)
@@ -439,9 +452,22 @@
 DEQUANTIZE_IMPL(char, 1)
 DEQUANTIZE_IMPL(uint, 1)
 DEQUANTIZE_IMPL(int, 1)
+DEQUANTIZE_IMPL(uchar, 2)
+DEQUANTIZE_IMPL(char, 2)
+DEQUANTIZE_IMPL(uint, 2)
+DEQUANTIZE_IMPL(int, 2)
+DEQUANTIZE_IMPL(uchar, 3)
+DEQUANTIZE_IMPL(char, 3)
+DEQUANTIZE_IMPL(uint, 3)
+DEQUANTIZE_IMPL(int, 3)
 DEQUANTIZE_IMPL(uchar, 4)
 DEQUANTIZE_IMPL(ushort, 4)
 DEQUANTIZE_IMPL(short, 4)
+DEQUANTIZE_IMPL(int, 4)
+DEQUANTIZE_IMPL(uchar, 8)
+DEQUANTIZE_IMPL(char, 8)
+DEQUANTIZE_IMPL(uint, 8)
+DEQUANTIZE_IMPL(int, 8)
 DEQUANTIZE_IMPL(uchar, 16)
 DEQUANTIZE_IMPL(char, 16)
 DEQUANTIZE_IMPL(ushort, 16)