Optimize CpuSoftmaxKernel for axis=0 Implement a single kernel instead of having two consecutive ones. In the previous setup, one kernel was calculating the maximum value in the axis, and this maximum was being subtracted from each data while calculating the softmax, i.e. softmax(x_i) = exp(x_i - max) / sum_i( exp(x_i - max) ) This patch integrates these two stages into a single kernel for Neon™ for all data types. This will save some memory because we don't need to hold the max values in a separate auxiliary tensor. It also introduces some other optimizations that will ease memory pressure when the data type is float/half, by using the dst tensor as temporary storage for already exponentiated inputs. It removes the references to SVE and SVE2 implementations, and most of the associated files; but, it leaves the implementations as these may be used in the future. Resolves: COMPMID-6500 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Change-Id: Icff9976d1214c4c6cbe15a62ca60b8a77d3784cc Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10688 Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>

commit: fadc9b1e0bba90d6a91beb65466b2a0895b3a5e4 [log] [tgz]
author: Gunes Bayir <gunes.bayir@arm.com> Tue Nov 07 05:43:07 2023 +0000
committer: Gunes Bayir <gunes.bayir@arm.com> Tue Dec 05 13:52:17 2023 +0000
tree: 7d095fefe3634b4ca86dc9088bb2990d64d3a7c8
parent: 23158b0a69b85c9c6e5a7f2457bfe10be04d6132 [diff] [blame]
diff --git a/src/cpu/kernels/softmax/generic/neon/fp16.cpp b/src/cpu/kernels/softmax/generic/neon/fp16.cpp
index 2e2adf3..db8f881 100644
--- a/src/cpu/kernels/softmax/generic/neon/fp16.cpp
+++ b/src/cpu/kernels/softmax/generic/neon/fp16.cpp

@@ -31,21 +31,18 @@
 {
 namespace cpu
 {
-void neon_fp16_softmax(const ITensor *in,
-                       const ITensor *max,
-                       void *const    tmp,
-                       ITensor       *out,
-                       const float    beta,
-                       bool           is_log,
-                       const Window  &window)
+
+template <bool IS_LOG>
+void neon_fp16_softmax(const ITensor *in, void *const tmp, ITensor *out, const float beta, const Window &window)
 {
-    return neon_softmax_logits_1d_float<float16_t>(in, max, tmp, out, beta, is_log, window);
+    return neon_softmax_float<float16_t, IS_LOG>(in, tmp, out, beta, window);
 }
 
-void neon_fp16_logits(const ITensor *in, ITensor *out, const Window &window)
-{
-    return neon_logits_1d_max<float16_t>(in, out, window);
-}
+template void
+neon_fp16_softmax<true>(const ITensor *in, void *const tmp, ITensor *out, const float beta, const Window &window);
+template void
+neon_fp16_softmax<false>(const ITensor *in, void *const tmp, ITensor *out, const float beta, const Window &window);
+
 } // namespace cpu
 } // namespace arm_compute
 #endif //defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
commit	fadc9b1e0bba90d6a91beb65466b2a0895b3a5e4	[log] [tgz]
author	Gunes Bayir <gunes.bayir@arm.com>	Tue Nov 07 05:43:07 2023 +0000
committer	Gunes Bayir <gunes.bayir@arm.com>	Tue Dec 05 13:52:17 2023 +0000
tree	7d095fefe3634b4ca86dc9088bb2990d64d3a7c8
parent	23158b0a69b85c9c6e5a7f2457bfe10be04d6132 [diff] [blame]