Make Sub kernel and operator stateless

- Rename NEArithmeticSubstractionKernel to CpuSubKernel and move files appropriately

- Add CpuSub under src/runtime/cpu/operators

Partially resolves: COMPMID-4007

Signed-off-by: Sheri Zhang <sheri.zhang@arm.com>
Change-Id: I4754ca9101d82dccacca744be6d069764a9c6b55
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/4868
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
diff --git a/arm_compute/runtime/NEON/functions/NEQLSTMLayer.h b/arm_compute/runtime/NEON/functions/NEQLSTMLayer.h
index 34f51d3..743a32c 100644
--- a/arm_compute/runtime/NEON/functions/NEQLSTMLayer.h
+++ b/arm_compute/runtime/NEON/functions/NEQLSTMLayer.h
@@ -51,7 +51,7 @@
  *
  * -# @ref NEActivationLayer                                     Activation functions (tanh and logistic)
  * -# @ref NEArithmeticAddition                                  Elementwise addition
- * -# @ref NEArithmeticSubtractionKernel                         Elementwise subtraction
+ * -# @ref NEArithmeticSubtraction                               Elementwise subtraction
  * -# @ref NECopy                                                Copy kernel for copying output_state_out to output
  * -# @ref NEGEMMLowpMatrixMultiplyCore                          Quantized matrix multiplication core. Accumulators are 32-bit integers
  * -# @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPoint   Convert 32-bit integers into QSYMM16