Implement MatMul Function and Operator with Floating Point support for CPU

- Implements MatMul function and operator for floating point datatype FP16/FP32
- Includes support for transposing dynamic tensors prior to matrix multiplication.
- Adds tests for 2D/3D/4D+ tensors in MatMul with F32/F16 datatype (with all combinations of transposed/not-transposed tensors)
- Updates fixture to allow for testing fused activation in MatMul
- Adds tests for matmul with and without fused activation

Resolved: [COMPMID-5898]
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>
Change-Id: Iefa84b26dd723c9a51e6c3f91023152c6c31ace2
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9411
Reviewed-by: SiCong Li <sicong.li@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
diff --git a/src/cpu/operators/internal/CpuGemmAssemblyDispatch.h b/src/cpu/operators/internal/CpuGemmAssemblyDispatch.h
index 0c51c92..588c452 100644
--- a/src/cpu/operators/internal/CpuGemmAssemblyDispatch.h
+++ b/src/cpu/operators/internal/CpuGemmAssemblyDispatch.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2018-2022 Arm Limited.
+ * Copyright (c) 2018-2023 Arm Limited.
  *
  * SPDX-License-Identifier: MIT
  *
@@ -82,6 +82,38 @@
 public:
     /** If supported create a Compute Library function else fallback to the arm_gemm function.
      *
+     * @note Configuring "batches"
+     * The shapes of @p a @p b and @p d are arranged as follows:
+     *     Lowest dimension <-> Highest dimension
+     * a: [K, M, Batch, Multi]
+     * b: [N, K, Multi]
+     * d: [N, M, Batch, Multi]
+     *
+     * The "Batch" refers to where "Batch" number of MxK slices of tensor a multiplies with a single KxN slice of b
+     * The "Multi" refers to where "Multi" number of individual multiplication of a with b
+     *
+     * E.g. the following are some example input shape configurations
+     *
+     * (1) Normal 2D gemm
+     * a: [K=3, M=4]
+     * b: [N=5, K=3]
+     * d: [N=5, M=4]
+     *
+     * (2) Batches of a sharing b (e.g. gemm-based batched convolution where b is the shared )
+     * a: [K=3, M=4, Batch=9]
+     * b: [N=5, K=3]
+     * d: [N=5, M=4, Batch=9]
+     *
+     * (3) "Batches" of independent gemm (e.g. batched matmul)
+     * a: [K=3, M=4, Batch=1, Multi=7]
+     * b: [N=5, K=3, Multi=7]
+     * d: [N=5, M=4, Batch=1, Multi=7]
+     *
+     * (4) "Batches" of independent gemm where b is also shared
+     * a: [K=3, M=4, Batch=4, Multi=7]
+     * b: [N=5, K=3, Multi=7]
+     * d: [N=5, M=4, Batch=4, Multi=7]
+     *
      * @param[in]  a    Input tensor (Matrix A)
      * @param[in]  b    Input tensor (Matrix B)
      * @param[in]  c    Input tensor (Matrix C) used to pass the bias for quantized calculations