blob: edd24a3d36012422949c04fcfbaafda70f03dab3 [file] [log] [blame]
Gian Marco Iodice78ce2732023-08-04 15:26:41 +01001///
2/// Copyright (c) 2023 Arm Limited.
3///
4/// SPDX-License-Identifier: MIT
5///
6/// Permission is hereby granted, free of charge, to any person obtaining a copy
7/// of this software and associated documentation files (the "Software"), to
8/// deal in the Software without restriction, including without limitation the
9/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
10/// sell copies of the Software, and to permit persons to whom the Software is
11/// furnished to do so, subject to the following conditions:
12///
13/// The above copyright notice and this permission notice shall be included in all
14/// copies or substantial portions of the Software.
15///
16/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22/// SOFTWARE.
23///
24
25namespace arm_compute
26{
27/**
28@page conv2d_heuristic Convolution 2D heuristic
29
30@section conv2d_heuristic_algorithms_used Convolution 2D heuristic: algorithm selection
31
32The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads.
33This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required.
34Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance.
35Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy.
36The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.
37
38⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.
39
40@subsection conv2d_heuristic_on_cpu Convolution 2D heuristic: Arm® Cortex®-based CPUs
41
42The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function.
43The algorithms used in the get_convolution_method() function are the following:
44- Direct-Conv2D
45- Im2Col+GeMM-based
46- Indirect-GeMM (a.k.a. GEMMCONV2D)
47- GeMM
48- Winograd
49
50⚠ Attention: Winograd only works with floating-point data types (F32, F16)
51
52The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
53-# Non unit dilation: We call Im2Col+GeMM
54-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
55-# Small Input-Feature-Maps (IFM): In this scenario, we have found that the GeMM implementation is generally the most efficient algorithm compared to Winograd and Indirect-GeMM
56
57If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:
58-# Unit kernel size (1x1): In this scenario, the conv2d operations corresponds to a matrix multiplication and we call GeMM.
59-# Winograd. Winograd only works with unit strides and supports a limited number of kernel sizes, such as 3x3, 3x1, 1x3, 5x1, 1x5 and 5x5
60-# Indirect-GeMM: It should be used in all cases expect when the kernel size is 1x1 or when the IFM is small
61
62If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.
63
64@subsection conv2d_heuristic_on_gpu Convolution 2D heuristic: Arm® Mali™-based GPUs
65
66The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.
67
68The algorithms used in the get_convolution_method() function are the following:
69- Direct-Conv2D
70- Im2Col+GeMM-based
71- Indirect-GeMM
72- GeMM
73- Winograd
74
75⚠ Attention: Winograd only works with floating-point data types (F32, F16)
76
77The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
78-# Non unit dilation: We call Im2Col+GeMM
79-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
80
81In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D.
82In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8).
83The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly.
84In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1.
85The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU.
86If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.
87
88*/
89} // namespace