Implement FP32/FP16 MatMul NT/NT kernel using the MMUL extension

Resolves COMPMID-6194

Signed-off-by: SiCong Li <sicong.li@arm.com>
Change-Id: Ie45e2aa9533948b2e5235563cef1d3834494eccf
Signed-off-by: SiCong Li <sicong.li@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9739
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
diff --git a/SConscript b/SConscript
index 904d5ba..320cb2d 100644
--- a/SConscript
+++ b/SConscript
@@ -395,6 +395,7 @@
                        'src/core/CL/cl_kernels/common/instance_normalization.cl',
                        'src/core/CL/cl_kernels/common/l2_normalize.cl',
                        'src/core/CL/cl_kernels/common/mat_mul.cl',
+                       'src/core/CL/cl_kernels/common/mat_mul_mmul.cl',
                        'src/core/CL/cl_kernels/common/mat_mul_quantized.cl',
                        'src/core/CL/cl_kernels/common/mean_stddev_normalization.cl',
                        'src/core/CL/cl_kernels/common/memset.cl',