[ONCPUML-1451] Add matmul kernel to enable bf16 to bf16 operations via PyTorch® autocast() function

The full range of tests must be added with [MLINFSW-482] epic due to the lack of reordering kernels implemented in Acl.

Co-Authored-By: David Mansell <David.Mansell@arm.com>
Change-Id: I820d316295a1ec94fdc89c37e4144a268f914c36
Signed-off-by: Renato Arantes <renato.arantes@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11169
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
diff --git a/Android.bp b/Android.bp
index 0d087c9..d216c67 100644
--- a/Android.bp
+++ b/Android.bp
@@ -324,6 +324,7 @@
         "src/core/NEON/kernels/arm_conv/pooling/pooling_u8.cpp",
         "src/core/NEON/kernels/arm_conv/pooling/pooling_u8q.cpp",
         "src/core/NEON/kernels/arm_gemm/gemm_bf16.cpp",
+        "src/core/NEON/kernels/arm_gemm/gemm_bf16bf16.cpp",
         "src/core/NEON/kernels/arm_gemm/gemm_fp16.cpp",
         "src/core/NEON/kernels/arm_gemm/gemm_fp32.cpp",
         "src/core/NEON/kernels/arm_gemm/gemm_int16.cpp",