Optimize CpuScale NHWC F32/F16

- Rework CpuScaleKernel F32/F16 NHWC - bilinear
- Rework CpuScaleKernel F32/F16 NHWC - nearest
- Add test to validate the vector computation path

Resolves COMPMID-4801, COMPMID-4802

Change-Id: Ie6e4f262a8cce509edd7b8f564c940758625c58a
Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/6361
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
diff --git a/src/cpu/kernels/CpuScaleKernel.cpp b/src/cpu/kernels/CpuScaleKernel.cpp
index 1108c7a..3063d8f 100644
--- a/src/cpu/kernels/CpuScaleKernel.cpp
+++ b/src/cpu/kernels/CpuScaleKernel.cpp
@@ -123,12 +123,12 @@
     {
         "neon_u8_scale",
         [](const ScaleSelectorData & data) { return data.dt == DataType::U8; },
-        REGISTER_INTEGER_NEON(arm_compute::cpu::common_neon_scale<uint8_t>)
+        REGISTER_INTEGER_NEON(arm_compute::cpu::u8_neon_scale)
     },
     {
         "neon_s16_scale",
         [](const ScaleSelectorData & data) { return data.dt == DataType::S16; },
-        REGISTER_INTEGER_NEON(arm_compute::cpu::common_neon_scale<int16_t>)
+        REGISTER_INTEGER_NEON(arm_compute::cpu::s16_neon_scale)
     },
 #endif /* defined(ARM_COMPUTE_ENABLE_NEON) */
 };