Fix parallel depthwise perf regression from 2db938c

Incorrect conditional meant that we were parallelizing over batches when
we should have been parallelizing over rows.

Relates to: ONCPUML-1443 COMPMID-6875

Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com>
Change-Id: I61d43bb2a94e8a6887d4cc5d1ae2ebb03295dff7
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11120
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
diff --git a/src/cpu/operators/CpuDepthwiseConv2dAssemblyDispatch.cpp b/src/cpu/operators/CpuDepthwiseConv2dAssemblyDispatch.cpp
index 38092ad..7fe9011 100644
--- a/src/cpu/operators/CpuDepthwiseConv2dAssemblyDispatch.cpp
+++ b/src/cpu/operators/CpuDepthwiseConv2dAssemblyDispatch.cpp
@@ -110,7 +110,7 @@
 
     // Split over rows (z) if there's more than 1, otherwise batches (w). This logic
     // corresponds to the threading strategy in DepthFirstDriver::execute_internal
-    auto split_dimension = _pImpl->asm_kernel->window().num_iterations(Window::DimZ) == 1 ? Window::DimZ : Window::DimW;
+    auto split_dimension = _pImpl->asm_kernel->window().num_iterations(Window::DimZ) != 1 ? Window::DimZ : Window::DimW;
 
     NEScheduler::get().schedule_op(_pImpl->asm_kernel.get(), split_dimension, _pImpl->asm_kernel->window(), tensors);
 }