[ONCPUML-1451] Add matmul kernel to enable bf16 to bf16 operations via PyTorch® autocast() function

The full range of tests must be added with [MLINFSW-482] epic due to the lack of reordering kernels implemented in Acl.

Co-Authored-By: David Mansell <David.Mansell@arm.com>
Change-Id: I820d316295a1ec94fdc89c37e4144a268f914c36
Signed-off-by: Renato Arantes <renato.arantes@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11169
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
diff --git a/docs/user_guide/operator_list.dox b/docs/user_guide/operator_list.dox
index 25c856d..36275e6 100644
--- a/docs/user_guide/operator_list.dox
+++ b/docs/user_guide/operator_list.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2021-2023 Arm Limited.
+/// Copyright (c) 2021-2023,2024 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -2091,6 +2091,7 @@
     <tr><th>lhs<th>rhs<th>dst
     <tr><td>F32<td>F32<td>F32
     <tr><td>F16<td>F16<td>F16
+    <tr><td>BFLOAT16<td>BFLOAT16<td>BFLOAT16
     <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
     <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
     </table>
diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
index 2d46737..31b7560 100644
--- a/docs/user_guide/release_version_and_change_log.dox
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -42,10 +42,11 @@
 @section S2_2_changelog Changelog
 
 v24.04 Public major release
+ - Add Bfloat16 data type support for @ref NEMatMul.
  - Optimize start-up time of @ref NEConvolutionLayer for some input configurations where GeMM is selected as the convolution algorithm
  - Optimize @ref NEConvolutionLayer for input tensor size > 1e7 bytes and weight tensor height > 7
  - Performance optimizations:
-  - Optimize @ref NESoftmaxLayer for axis != 0 by natively supporting higher axes up to axis 3.
+   - Optimize @ref NESoftmaxLayer for axis != 0 by natively supporting higher axes up to axis 3.
 
 v24.02.1 Public patch release
  - Fix performance regression in fixed-format kernels