Implement Fanout mode in CPPScheduler

This new scheduler mode is implemented to reduce runtime overhead on
high thread counts by distributing the scheduling work to all threads.

The fanout mode should only be enabled on high thread counts
(e.g. > 8 threads).

Alternatively the mode can be forced by setting the environment variable
ARM_COMPUTE_CPP_SCHEDULER_MODE to be either "linear" (default) or
"fanout". Note that on bare-metal this functionality is turned off but
it does not matter as only multi-threading is not supported on
bare-metal.

Resolves COMPMID-4349

Signed-off-by: SiCongLi <sicong.li@arm.com>
Change-Id: I46e2fab83ea24e616c82ae94dca7b2e72a73c7b8
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5352
Reviewed-by: Michele Di Giorgio <michele.digiorgio@arm.com>
Reviewed-by: Georgios Pinitas <georgios.pinitas@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
diff --git a/arm_compute/runtime/CPP/CPPScheduler.h b/arm_compute/runtime/CPP/CPPScheduler.h
index f4f6a13..a5932d6 100644
--- a/arm_compute/runtime/CPP/CPPScheduler.h
+++ b/arm_compute/runtime/CPP/CPPScheduler.h
@@ -31,7 +31,14 @@
 
 namespace arm_compute
 {
-/** C++11 implementation of a pool of threads to automatically split a kernel's execution among several threads. */
+/** C++11 implementation of a pool of threads to automatically split a kernel's execution among several threads.
+ *
+ * It has 2 scheduling modes: Linear or Fanout (please refer to the implementation for details)
+ * The mode is selected automatically based on the runtime environment. However it can be forced via an environment
+ * variable ARM_COMPUTE_CPP_SCHEDULER_MODE. e.g.:
+ * ARM_COMPUTE_CPP_SCHEDULER_MODE=linear      # Force select the linear scheduling mode
+ * ARM_COMPUTE_CPP_SCHEDULER_MODE=fanout      # Force select the fanout scheduling mode
+*/
 class CPPScheduler final : public IScheduler
 {
 public: