Blame - docs/user_guide/conv2d_heuristic.dox - ml/ComputeLibrary

blob: edd24a3d36012422949c04fcfbaafda70f03dab3 [file] [log] [blame]

Gian Marco Iodice	78ce273	2023-08-04 15:26:41 +0100	[diff] [blame]	1	///
				2	/// Copyright (c) 2023 Arm Limited.
				3	///
				4	/// SPDX-License-Identifier: MIT
				5	///
				6	/// Permission is hereby granted, free of charge, to any person obtaining a copy
				7	/// of this software and associated documentation files (the "Software"), to
				8	/// deal in the Software without restriction, including without limitation the
				9	/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
				10	/// sell copies of the Software, and to permit persons to whom the Software is
				11	/// furnished to do so, subject to the following conditions:
				12	///
				13	/// The above copyright notice and this permission notice shall be included in all
				14	/// copies or substantial portions of the Software.
				15	///
				16	/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
				17	/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
				18	/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
				19	/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
				20	/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
				21	/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
				22	/// SOFTWARE.
				23	///
				24
				25	namespace arm_compute
				26	{
				27	/**
				28	@page conv2d_heuristic Convolution 2D heuristic
				29
				30	@section conv2d_heuristic_algorithms_used Convolution 2D heuristic: algorithm selection
				31
				32	The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads.
				33	This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required.
				34	Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance.
				35	Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy.
				36	The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.
				37
				38	⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.
				39
				40	@subsection conv2d_heuristic_on_cpu Convolution 2D heuristic: Arm® Cortex®-based CPUs
				41
				42	The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function.
				43	The algorithms used in the get_convolution_method() function are the following:
				44	- Direct-Conv2D
				45	- Im2Col+GeMM-based
				46	- Indirect-GeMM (a.k.a. GEMMCONV2D)
				47	- GeMM
				48	- Winograd
				49
				50	⚠ Attention: Winograd only works with floating-point data types (F32, F16)
				51
				52	The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
				53	-# Non unit dilation: We call Im2Col+GeMM
				54	-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
				55	-# Small Input-Feature-Maps (IFM): In this scenario, we have found that the GeMM implementation is generally the most efficient algorithm compared to Winograd and Indirect-GeMM
				56
				57	If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:
				58	-# Unit kernel size (1x1): In this scenario, the conv2d operations corresponds to a matrix multiplication and we call GeMM.
				59	-# Winograd. Winograd only works with unit strides and supports a limited number of kernel sizes, such as 3x3, 3x1, 1x3, 5x1, 1x5 and 5x5
				60	-# Indirect-GeMM: It should be used in all cases expect when the kernel size is 1x1 or when the IFM is small
				61
				62	If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.
				63
				64	@subsection conv2d_heuristic_on_gpu Convolution 2D heuristic: Arm® Mali™-based GPUs
				65
				66	The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.
				67
				68	The algorithms used in the get_convolution_method() function are the following:
				69	- Direct-Conv2D
				70	- Im2Col+GeMM-based
				71	- Indirect-GeMM
				72	- GeMM
				73	- Winograd
				74
				75	⚠ Attention: Winograd only works with floating-point data types (F32, F16)
				76
				77	The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
				78	-# Non unit dilation: We call Im2Col+GeMM
				79	-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
				80
				81	In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D.
				82	In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8).
				83	The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly.
				84	In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1.
				85	The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU.
				86	If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.
				87
				88	*/
				89	} // namespace