Make memset/copy functions state-less

Port following functions:
- CLCopy
- CLFill
- CLPermute
- CLReshapeLayer
- CLCropResize

Resolves: COMPMID-4002

Signed-off-by: Sheri Zhang <sheri.zhang@arm.com>
Change-Id: I8392aa515aaeb5b44dab6122be6a795d08376d5f
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5003
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Michele Di Giorgio <michele.digiorgio@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
index ab2495d..2f23996 100644
--- a/docs/00_introduction.dox
+++ b/docs/00_introduction.dox
@@ -191,7 +191,7 @@
    - @ref CLFuseBatchNormalizationKernel
    - @ref CLDepthwiseConvolutionLayerNativeKernel
    - @ref CLDepthConvertLayerKernel
-   - @ref CLCopyKernel
+   - CLCopyKernel
    - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
    - CLActivationLayerKernel
    - @ref CLWinogradFilterTransformKernel
@@ -434,7 +434,7 @@
    - @ref CLArgMinMaxLayerKernel
  - Added new data type U8 support for:
    - @ref NECropKernel
-   - @ref CLCropKernel
+   - CLCropKernel
  - Added aligh_corner support for nearest neighbor interpolation in:
    - @ref NEScaleKernel
    - @ref CLScaleKernel
@@ -777,7 +777,7 @@
     - @ref NEFFTConvolutionLayer
  - New OpenCL kernels / functions:
     - @ref CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
-    - @ref CLCropKernel / @ref CLCropResize
+    - CLCropKernel / @ref CLCropResize
     - @ref CLDeconvolutionReshapeOutputKernel
     - @ref CLFFTDigitReverseKernel
     - @ref CLFFTRadixStageKernel
@@ -1019,7 +1019,7 @@
  - New OpenCL kernels / functions:
     - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
     - @ref CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
-    - @ref CLCopy / @ref CLCopyKernel
+    - @ref CLCopy / CLCopyKernel
     - @ref CLLSTMLayer
     - @ref CLRNNLayer
     - CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
@@ -1103,7 +1103,7 @@
  - Various bug fixes
  - Added some of the missing validate() methods
  - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
- - Added @ref CLPermuteKernel / @ref CLPermute
+ - Added CLPermuteKernel / @ref CLPermute
  - Added method to clean the programs cache in the CL Kernel library.
  - Added @ref GCArithmeticAdditionKernel / @ref GCArithmeticAddition
  - Added @ref GCDepthwiseConvolutionLayer3x3Kernel / @ref GCDepthwiseConvolutionLayer3x3
@@ -1218,7 +1218,7 @@
     - @ref CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer
     - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
     - @ref CLReductionOperationKernel / @ref CLReductionOperation
-    - @ref CLReshapeLayerKernel / @ref CLReshapeLayer
+    - CLReshapeLayerKernel / @ref CLReshapeLayer
 
 v17.06 Public major release
  - Various bug fixes
diff --git a/docs/04_adding_operator.dox b/docs/04_adding_operator.dox
index 9e6f375..f311fb4 100644
--- a/docs/04_adding_operator.dox
+++ b/docs/04_adding_operator.dox
@@ -117,7 +117,7 @@
 
 The structure of the kernel .cpp file should be similar to the next ones.
 For OpenCL:
-@snippet src/core/CL/kernels/CLReshapeLayerKernel.cpp CLReshapeLayerKernel Kernel
+@snippet src/core/gpu/cl/kernels/ClReshapeKernel.cpp ClReshapeKernel Kernel
 The run will call the function defined in the .cl file.
 
 For the NEON backend case: