Port CLGEMM to memory injecting interface

Moves the following kernels:
 - CLGEMMMatrixMultiplyKernel
 - CLGEMMMatrixMultiplyNativeKernel
 - CLGEMMMatrixMultipluReshapedKernel
 - CLGEMMMatrixMultiplyReshapedOnlyRHSKernel

 Moves the following functions
 - CLGEMM

Introduces facilities to easy handling of auxiliary temporary buffers
under then new run interface. Such are:
 - CLAuxTensorHandler: That allows wrapping of workspace buffers memory
 to CLBuffer objects
 - Ability to inject TensorInfo to allocator without transferring
 ownership. This reduce the copy overhead if needed.

Resolves: COMPMID-4188

Signed-off-by: Georgios Pinitas <georgios.pinitas@arm.com>
Change-Id: I7055435d831b05b749b26302082e4ac45f26dfb0
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5498
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Michalis Spyrou <michalis.spyrou@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
index 0f4d4cc..a975e8b 100644
--- a/docs/user_guide/release_version_and_change_log.dox
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -280,7 +280,7 @@
    - CLLogits1DMaxShiftExpSumKernel
    - CLLogits1DNormKernel
    - CLHeightConcatenateLayerKernel
-   - @ref CLGEMMMatrixMultiplyKernel
+   - CLGEMMMatrixMultiplyKernel
    - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
    - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
@@ -567,14 +567,14 @@
       The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
       Only axis 0 is supported.
  - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
- - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
+ - Removed padding requirement for the input (e.g. LHS of GEMM) and output in CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel, CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
    - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
    - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
-   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding.
- - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since CLGEMMMatrixMultiplyKernel is called and currently requires padding.
+ - Added support for exporting the OpenCL buffer object to the OpenCL image object in CLGEMMMatrixMultiplyReshapedKernel and CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
    - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
-   - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel.
-   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer.
+   - The padding requirement for the OpenCL image object is considered into the CLGEMMReshapeRHSMatrixKernel.
+   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate CLGEMMConvolutionLayer.
 
 v20.05 Public major release
  - Various bug fixes.
@@ -739,7 +739,7 @@
  - Added QASYMM16 support for:
     - @ref CLBoundingBoxTransform
  - Added FP16 support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - CLGEMMMatrixMultiplyReshapedKernel
  - Added new data type QASYMM8_PER_CHANNEL support for:
     - CLDequantizationLayer
     - @ref NEDequantizationLayer
@@ -749,7 +749,7 @@
     - @ref CLDepthwiseConvolutionLayer
     - @ref NEDepthwiseConvolutionLayer
  - Added FP16 mixed-precision support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - CLGEMMMatrixMultiplyReshapedKernel
     - CLPoolingLayerKernel
  - Added FP32 and FP16 ELU activation for:
     - @ref CLActivationLayer
@@ -813,9 +813,9 @@
     - @ref CLSinLayer
     - CLBatchConcatenateLayerKernel
     - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
-    - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+    - CLGEMMLowpMatrixMultiplyNativeKernel
     - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-    - @ref CLGEMMMatrixMultiplyNativeKernel
+    - CLGEMMMatrixMultiplyNativeKernel
     - CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
     - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
  - New examples:
@@ -862,7 +862,7 @@
     - @ref CLFFTRadixStageKernel
     - @ref CLFFTScaleKernel
     - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
+    - CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
     - CLHeightConcatenateLayerKernel
     - @ref CLDirectDeconvolutionLayer
     - @ref CLFFT1D
@@ -947,9 +947,9 @@
     - @ref CLRsqrtLayer
     - @ref CLExpLayer
     - CLElementWiseUnaryLayerKernel
-    - @ref CLGEMMReshapeLHSMatrixKernel
-    - @ref CLGEMMReshapeRHSMatrixKernel
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - CLGEMMReshapeLHSMatrixKernel
+    - CLGEMMReshapeRHSMatrixKernel
+    - CLGEMMMatrixMultiplyReshapedKernel
     - @ref CLRangeKernel / @ref CLRange
     - @ref CLUnstack
     - @ref CLGatherKernel / @ref CLGather
@@ -1369,7 +1369,7 @@
 v17.03 Sources preview
  - New OpenCL kernels / functions:
    - CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
-   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
+   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
    - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
    - CLTransposeKernel / @ref CLTranspose
    - CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow