COMPMID-2418: CLDequantizationLayer support for QASYMM8_PER_CHANNEL

Add support for QASYMM8_PER_CHANNEL in CLDequantiazationLayer.
Added tests for NHWC and also updated NEON code to work with NHWC
data layout.
Cleaned up the reference implementation.

Change-Id: Ic1d51f16f7f625503fffdbbb66f6487aa588f08c
Signed-off-by: Michalis Spyrou <michalis.spyrou@arm.com>
Reviewed-on: https://review.mlplatform.org/c/1828
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Georgios Pinitas <georgios.pinitas@arm.com>
diff --git a/arm_compute/runtime/CL/CLTensorAllocator.h b/arm_compute/runtime/CL/CLTensorAllocator.h
index 982cc51..f7800d3 100644
--- a/arm_compute/runtime/CL/CLTensorAllocator.h
+++ b/arm_compute/runtime/CL/CLTensorAllocator.h
@@ -146,8 +146,8 @@
     CLMemory       _memory;                  /**< OpenCL memory */
     uint8_t       *_mapping;                 /**< Pointer to the CPU mapping of the OpenCL buffer. */
     CLTensor      *_owner;                   /**< Owner of the allocator */
-    CLFloatArray   _scale;
-    CLInt32Array   _offset;
+    CLFloatArray   _scale;                   /**< Scales array in case of quantized per channel data type */
+    CLInt32Array   _offset;                  /**< Offsets array in case of quantized per channel data type */
 };
 } // namespace arm_compute
 #endif /* __ARM_COMPUTE_CLTENSORALLOCATOR_H__ */