COMPMID-3456 Update gemm tuner documentation

* Update README with the improvements
* Add a new step-by-step example section

Change-Id: I4d76821fb6c2f3b5edd54edfeff053e1c92fbb6e
Signed-off-by: SiCong Li <sicong.li@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/3713
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Sheri Zhang <sheri.zhang@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
index 906ddf2..9006439 100644
--- a/docs/00_introduction.dox
+++ b/docs/00_introduction.dox
@@ -257,6 +257,11 @@
    - graph_yolov3_output_detector
  - Removed padding from:
    - @ref NEPixelWiseMultiplicationKernel
+ - GEMMTuner improvements:
+     - Added fp16 support
+     - Output json files for easier integration
+     - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
+     - More robust script for running benchmarks
  - Deprecated functions / interfaces:
    - Non-descriptor based interfaces for @ref NEThreshold, @ref CLThreshold
    - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer :
diff --git a/examples/gemm_tuner/README.md b/examples/gemm_tuner/README.md
index a4cde10..1effd2f 100644
--- a/examples/gemm_tuner/README.md
+++ b/examples/gemm_tuner/README.md
@@ -2,19 +2,77 @@
 
 ## Introduction
 
-This is a set of 2 script tools for tuning the performance of OpenCL GEMM kernels (limited to Convolution layer
-functions only for now).  Specifically, we tune 3 GEMM kernels, each has a different implementation **strategy** of the
-GEMM operation: **native**, **reshaped**, **reshaped only rhs**. The details of these strategies can be found in the
-documentations of the corresponding kernels: **CLGEMMMatrixMultiplyNativeKernel**,
-**CLGEMMMatrixMultiplyReshapedKernel** and **CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
+This is a set of tools for tuning the performance of OpenCL GEMM kernels.  Specifically, we tune 3 GEMM kernels, each
+has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
+The details of these strategies can be found in the documentations of the corresponding kernels:
+**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
+**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
 
-The outputs of the tuning process are 1 optimal configuration (called **GEMM Configuration** or **GEMMConfig**, for
-more details see Approach section) for each of the 3 strategies.
+The Tuner consists of 2 scripts and 3 binaries:
+* benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and
+* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
+  build/tests/gemm_tuner (you'll need to build the library first)
 
-## Location
-The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found under $ACL_ROOT/examples/gemm_tuner.
+The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
+data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
+```
+LHS x RHS = DST
+```
+Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
 
-## Pre-requisite
+The outputs of the tuning process are 4 json files:
+1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
+2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
+3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
+4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
+
+These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
+what kernel and subsequently what configurations for that kernels are the most performant.
+
+## Step-by-step example
+
+### Step1: Prepare the shape and configs files
+1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
+2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
+    some prior heuristics, but can be provided by the ACL developers upon requests, based on your target device).
+
+   Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
+
+   Please refer to the Prerequisite section for more details
+
+### Step2: Push relevant files to the target device
+All the files that need to be present on the target device are:
+* benchmark script: \<ACL\>/examples/gemm_tuner/benchmark_gemm_examples.sh
+* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
+* Example benchmark binaries: \<ACL\>/build/tests/gemm_tuner/benchmark_cl_gemm*
+
+### Step3: Collect benchmark data
+With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
+to a folder called *gemm_tuner*. While logged onto our device:
+```
+# Native
+./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
+# Reshaped Only RHS
+./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
+# Reshaped
+./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
+```
+You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
+but you may need to change the output folder for each repeat
+
+### Step4: Generate the heuristics
+1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
+2. We use the GemmTuner.py script to give us the heuristics
+   ```
+   python3 <ACL>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
+   ```
+   When it's finished, there should be 4 json files in the *heuristics* folder
+
+One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
+we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
+passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
+
+## Prerequisite
 * A target device to be tuned, plus the following on the device:
     * Android or Linux OS
     * Bash shell
@@ -28,10 +86,7 @@
 
        The format is described as:
 
-       A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each
-       field).
-
-       Note also comments and extraneous empty lines are not permitted.
+       A headerless csv file with fields separated by commas.
 
        A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
        RHS) with:
@@ -54,10 +109,10 @@
 
       The format of the file for each strategy is the same:  
 
-      A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each
-      field). Note also comments and extraneous empty lines are not permitted.
+      A headerless csv file with fields separated by commas.
 
       However the fields of GEMMConfig differ for each strategy:
+
       * Strategy **native**:
         A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
 
@@ -78,9 +133,7 @@
   ...
   ```
       * Strategy **reshaped_rhs_only**:
-
-        A gemm config is a list of 4 positive integers \<m0, n0, k0, h0\> and 2 boolean values interleave_rhs and
-        transpose_rhs, with:
+        A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
 
         m0 - Number of rows processed by the matrix multiplication  
         n0 - Number of columns processed by the matrix multiplication  
@@ -88,6 +141,9 @@
         h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row  
         interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)  
         transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)  
+        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+                                for more details
 
         Only the following configurations of M0, N0 and K0 are currently supported:
 
@@ -98,14 +154,12 @@
 
         An example gemm config file looks like:
   ```
-  4,4,4,1,1,1
-  4,4,4,3,1,0
+  4,4,4,1,1,1,0
+  4,4,4,3,1,0,1
   ...
   ```
       * Strategy **reshaped**:
-
-        A gemm config is a list of 5 positive integers \<m0, n0, k0, v0, h0\> and 3 boolean values interleave_lhs,
-        interleave_rhs and transpose_rhs, with:
+        A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
 
         m0 - Number of rows processed by the matrix multiplication  
         n0 - Number of columns processed by the matrix multiplication  
@@ -114,29 +168,31 @@
         h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row  
         interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)  
         interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)  
-        transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose
-        lhs matrix (0)  
+        transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)  
+        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+                                for more details
 
-        * If rhs matrix is transposed only the following configurations are currently supported:
+        If rhs matrix is transposed only the following configurations are currently supported:
 
-          M0 = 2, 3, 4, 5, 6, 7, 8  
-          N0 = 2, 3, 4, 8, 16  
-          K0 = 2, 3, 4, 8, 16  
-          V0 >= 1  
-          H0 >= 1  
+        M0 = 2, 3, 4, 5, 6, 7, 8  
+        N0 = 2, 3, 4, 8, 16  
+        K0 = 2, 3, 4, 8, 16  
+        V0 >= 1  
+        H0 >= 1  
 
-        * If lhs matrix is transposed only the following configurations are currently supported:
+        If lhs matrix is transposed only the following configurations are currently supported:
 
-          M0 = 2, 3, 4, 8  
-          N0 = 2, 3, 4, 8, 16  
-          K0 = 2, 3, 4, 8, 16  
-          V0 >= 1  
-          H0 >= 1  
+        M0 = 2, 3, 4, 8  
+        N0 = 2, 3, 4, 8, 16  
+        K0 = 2, 3, 4, 8, 16  
+        V0 >= 1  
+        H0 >= 1  
 
         An example gemm config file looks like:
   ```
-  4,4,4,1,3,1,1,1
-  4,4,4,3,3,1,1,0
+  4,4,4,1,3,1,1,1,0
+  4,4,4,3,3,1,1,0,1
   ...
   ```
 * A host machine, plus these on the machine:
@@ -144,45 +200,53 @@
     * GemmTuner.py script
 
 ## Usage
-The tuning stage consists of 2 steps:
+The usage of the 2 scripts:
 
-1. Run benchmarks:
+1. benchmark_gemm_examples.sh
 
    Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark
-   examples have to be present on your target device prior to running. The benchmark results will be saved to json
-   files in an output directory.
+   examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
+   The benchmark results will be saved to json files in an output directory.
    ```
    Usage: benchmark_gemm_examples.sh [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
-   -c \<gemm_config_file\> [-o \<out_dir\>]
+   -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
 
    Options:
            -h
-           Print help messages. If a strategy is specified with -s \<strategy\>, then only display messages relevant
-           to that strategy. Otherwise if no strategy is specified, display messages for all available strategies.
+           Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
+           strategy. Otherwise if no strategy is specified, display messages for all available strategies.
 
-           -s \<strategy\>
+           -s <strategy>
            Strategy option.
-           Options: native reshaped_rhs_only reshaped.
+           Options: ${ALL_STRATEGY_OPTIONS[@]}.
 
-           -e \<example_binary_dir\>
+           -e <example_binary_dir>
            Path to directory that holds all example binaries
 
-           -g \<gemm_shape_file\>
+           -g <gemm_shape_file>
            Path to gemm shape csv file
 
-           -c \<gemm_config_file\>
+           -c <gemm_config_file>
            Path to gemm config csv file
 
-           -o \<out_dir\>
+           -d <data_type>
+           Data type option with which to run benchmark examples
+           Default: ${DEFAULT_DATA_TYPE}
+           Supported options:
+           Strategy            :    Data Types
+           Native              :    F32
+           Reshaped            :    F16, F32
+           Reshaped RHS Only   :    F16, F32
+
+           -o <out_dir>
            Path to output directory that holds output json files
-           Default: out
+           Default: ${DEFAULT_OUT_DIR}
    ```
-2. Run analyser:
+2. GemmTuner.py:
 
   Run the python script (**GemmTuner.py**) on your **host machine**.
   You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
-  beforehand. The script will output the best configuration, along with some analysis statistics for each strategy, and
-  optionally save the parsed benchmark results into csv files (one for each strategy) for further analysis.
+  beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
    ```
    Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
 
@@ -194,40 +258,11 @@
                            result json files have a file extension of
                            'gemmtuner_benchmark'
      -o PATH, --output_dir PATH
-                           Path to directory that holds output csv files. One per
-                           strategy
+                           Path to directory that holds output json files.
      -t TOLERANCE, --tolerance TOLERANCE
                            For testing if two GEMMConfigs are equivalent in terms
                            of performance. The tolerance is OpenCL timer in
                            milliseconds. Recommended value: <= 0.1 ms
      -D, --debug           Enable script debugging output
 
-   ```
-
-## Approach
-
-This section gives a brief description and rationale of the approach adopted by the current version of GEMM Tuner.
-
-As explained in the Introduction section, the outputs of the tuner are 1 optimal GEMMConfig for each strategy.
-This is because we can only integrate 1 GEMMConfig for each strategy in ACL at compile time. In theory, however, the
-optimal GEMMConfig also depends on different parameters of GEMM (called GEMM Parameter or GEMMParam, e.g.: the shape
-of the operation); thus ideally, for each strategy, the optimal configurations should be a mapping from GEMMParam to
-GEMMConfig instead of a single GEMMConfig.
-
-To address this issue, we ensure the one single optimal GEMMConfig can generalise well to all potential GEMMParams
-(or at least the ones that we care about). The approach we adopt involves a preliminary stage where a collection of
-common GEMMParams (GEMM shapes from popular networks) are compiled. Then, to reduce the final tuning time, rather
-contradictorily, we spend a lot of time searching for near-optimal GEMMConfigs for each GEMMParam first, and then
-discard redundant GEMMParams which share similar optimal GEMMConfigs with others. The resultant list of GEMMParams is
-called a __GEMMParam search list__, as in these GEMMParams are typical enough to capture the space of GEMMParams that
-we care about.
-
-During this preliminary stage we also produce a list of good GEMMConfigs that can be used to search for the optimal one
-in the actual tuning stage. This, again, is to reduce the tuning time, and the resultant list is called a
-__GEMMConfig search list__.
-
-The GEMMParam search list and the GEMMConfig search list are investigated and prepared by the developers; the users of
-GEMM tuner need not worry about producing them, but they need to obtain them prior to running the tuner.
-
-Once these two lists (2 for each strategy, so 6 in total) are obtained, they can be fed to the tuner, to produce the
-optimal GEMMConfig(s).
\ No newline at end of file
+   ```
\ No newline at end of file
diff --git a/examples/gemm_tuner/benchmark_gemm_examples.sh b/examples/gemm_tuner/benchmark_gemm_examples.sh
index bb9ec0f..f764cfa 100755
--- a/examples/gemm_tuner/benchmark_gemm_examples.sh
+++ b/examples/gemm_tuner/benchmark_gemm_examples.sh
@@ -59,10 +59,7 @@
 function help_gemm_shape_file() {
   cat >&2 << EOF
 Gemm shape file:
-  Gemm shape file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
-  around each field).
-
-  Note also comments and extraneous empty lines are not permitted.
+  Gemm shape file is a headerless csv file with fields separated by commas
 
   A gemm shape is a list of 4 positive integers <M, N, K, B> describing the shapes of the two matrices (LHS and RHS)
   with:
@@ -91,10 +88,7 @@
 function help_gemm_config_file_native() {
   cat >&2 << EOF
 Gemm config file (Strategy native):
-  Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
-  around each field).
-
-  Note also comments and extraneous empty lines are not permitted.
+  Gemm config file is a headerless csv file with fields separated by commas
 
   A gemm config is a list of 3 positive integers <m0, n0, k0>, with:
   m0 - Number of rows processed by the matrix multiplication
@@ -126,19 +120,20 @@
 function help_gemm_config_file_reshaped_rhs_only() {
   cat >&2 << EOF
 Gemm config file (Strategy reshaped_rhs_only):
-  Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
-  around each field).
+  Gemm config file is a headerless csv file with fields separated by commas.
 
   Note also comments and extraneous empty lines are not permitted.
 
-  A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 2 boolean values interleave_rhs and transpose_rhs, with:
+  A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
   m0 - Number of rows processed by the matrix multiplication
   n0 - Number of columns processed by the matrix multiplication
   k0 - Number of partial accumulations performed by the matrix multiplication
   h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
   interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
   transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
-  export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0)
+  export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+                           with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+                           for more details
 
   Only the following configurations of M0, N0 and K0 are currently supported:
   M0 = 1, 2, 3, 4, 5, 6, 7, 8
@@ -166,12 +161,9 @@
 function help_gemm_config_file_reshaped() {
   cat >&2 << EOF
 Gemm config file (Strategy reshaped):
-  Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
-  around each field).
+  Gemm config file is a headerless csv file with fields separated by commas
 
-  Note also comments and extraneous empty lines are not permitted.
-
-  A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 3 boolean values interleave_lhs, interleave_rhs and transpose_rhs, with:
+  A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
   m0 - Number of rows processed by the matrix multiplication
   n0 - Number of columns processed by the matrix multiplication
   k0 - Number of partial accumulations performed by the matrix multiplication
@@ -180,7 +172,9 @@
   interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
   interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
   transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
-  export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0)
+  export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+                           with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+                           for more details
 
   If rhs matrix is transposed only the following configurations are currently supported:
   M0 = 2, 3, 4, 5, 6, 7, 8
@@ -218,7 +212,7 @@
 Run gemm examples of a selected strategy, over provided tunable configurationsa and gemm shapes.
 Save the benchmark results to json files in an output directory.
 
-Usage: ${CMD} [-h] -s <strategy> -e <example_binary_dir> -g <gemm_shape_file> -c <gemm_config_file> [-o <out_dir>]
+Usage: ${CMD} [-h] -s <strategy> -e <example_binary_dir> -g <gemm_shape_file> -c <gemm_config_file> [-d <data_type>] [-o <out_dir>]
 
 Options:
         -h