| // |
| // This confidential and proprietary software may be used only as |
| // authorised by a licensing agreement from ARM Limited |
| // (C) COPYRIGHT 2020-2024 ARM Limited |
| // ALL RIGHTS RESERVED |
| // The entire notice above must be reproduced on all authorised |
| // copies and copies may only be made to the extent permitted |
| // by a licensing agreement from ARM Limited. |
| |
| == Introduction |
| |
| === Overview |
| |
| Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor |
| operations commonly employed by Deep Neural Networks. The intent is to enable a |
| variety of implementations running on a diverse range of processors, with the |
| results at the TOSA level consistent across those implementations. Applications |
| or frameworks which target TOSA can therefore be deployed on a wide range of |
| different processors, such as SIMD CPUs, GPUs and custom hardware such as |
| NPUs/TPUs, with defined accuracy and compatibility constraints. Most operators |
| from the common ML frameworks (TensorFlow, PyTorch, etc.) should be expressible |
| in TOSA. It is expected that there will be tools to lower from ML frameworks |
| into TOSA. |
| |
| === Goals |
| |
| The goals of TOSA include the following: |
| |
| * A minimal and stable set of tensor-level operators to which machine learning |
| framework operators can be reduced. |
| |
| * Full support for both quantized integer and floating-point content. |
| |
| * Precise functional description of the behavior of every operator, including |
| the treatment of their numerical behavior in the case of precision, saturation, |
| scaling, and range as required by quantized datatypes. |
| |
| * Agnostic to any single high-level framework, compiler backend stack or |
| particular target. |
| |
| * The detailed functional and numerical description enables precise code |
| construction for a diverse range of targets – SIMD CPUs, GPUs and custom |
| hardware such as NPUs/TPUs. |
| |
| === Specification |
| |
| The TOSA Specification is written as AsciiDoc mark-up and developed in its raw |
| mark-up form, managed through a git repository here: |
| https://git.mlplatform.org/tosa/specification.git/. |
| The specification is developed and versioned much like software. |
| While the mark-up is legible and can be read fairly easily in its raw form, it is recommended to build or “render” the mark-up into PDF or HTML. |
| To do this, please follow the instructions in the README.md in the root of the specification repository. |
| |
| === Operator Selection Principles |
| |
| TOSA defines a set of primitive operators to which higher level operators can be lowered in a consistent way. |
| To remain effective and efficient to implement, the set of operators must be constrained to a reasonably small set of primitive operations out of which others can be constructed. |
| The following principles govern the selection of operators within TOSA. |
| |
| .Principles |
| [cols="1,5,5"] |
| |=== |
| |ID|Principle|Reason for this |
| |
| |P0 |
| |An operator shall be a primitive operation or building block that cannot be decomposed into simpler whole tensor operations. |
| |If the operator can be broken down, then we should look at the component operators. |
| |
| |P1 |
| |An operator shall be a usable as a component out of which more complex operations can be constructed. |
| |Single use operators have a high architectural cost and a more reusable version should be considered instead. |
| |
| |P2 |
| |Precision should be appropriate for the input and output data types. |
| |Precision higher than that needed to calculate the result leads to extra implementation cost. |
| |
| |P3 |
| |Numerical definition of common sub-operations should be consistent between operators (for example: value scaling). |
| |Consistent sub-operation definition reduces the operator implementation cost. |
| |
| |P4 |
| |The valid input and output ranges for all arguments shall be specified. |
| |Ranges are required to make consistent (numerically agreeing) implementations possible. |
| |
| |P5 |
| |Integer operators shall be implementable in a bit-exact form with good efficiency on CPU, GPU and hardware targets. |
| |Reduces implementation cost and gives consistent inference results. |
| |=== |
| |
| === Profiles |
| |
| TOSA supports three profiles that enable efficient implementation on different classes of device. |
| The Base Inference profile is intended for embedded integer/fixed-point designs performing inference only. |
| The Main Inference profile is intended for general inference functionality including integer and floating-point data types. |
| The Main Training profile adds training operators in addition to inference operators. |
| This version of the specification covers the Base Inference and Main Inference profiles. |
| Main Training profile is expected in a later version of the specification. |
| The following table summarizes the three profiles: |
| |
| .Profiles |
| |=== |
| |Profile|Name|Integer Inference|Floating-point Inference|Training |
| |
| |Base Inference|TOSA-BI|Yes|No|No |
| |Main Inference|TOSA-MI|Yes|Yes|No |
| |Main Training|TOSA-MT|Yes|Yes|Yes |
| |=== |
| |
| === Levels |
| |
| A TOSA level defines operator argument ranges that an implementation shall support. |
| This is distinct from a profile that defines the operations and data-types supported. |
| This version of the specification defines two TOSA levels: |
| |
| * No level : allows the full range of arguments specified by the operations according to the operation data types. |
| * Level 8K : ranges are expected to be sufficient for applications with frame sizes up to 8K. |
| |
| Later versions of the specification may define additional levels. |
| The following table defines the value ranges for Level 1.0. |
| These ranges are checked using the LEVEL_CHECK() function with the operator descriptions. |
| |
| .Level maximums |
| include::{generated}/levels.adoc[] |
| |
| === Status |
| |
| The TOSA specification is a work in progress. |
| |
| * The Base Inference profile should be considered to be near release quality, with conformance tests available. |
| * The Main Inference profile has most of the expected operators in place, but is still subject to change. |
| * The reference model and conformance tests do not yet support all of the floating point types that have been defined. |
| * There is not currently a conformance test suite available for Main Inference. |
| * Main Training profile is pre-alpha, significant work still needs to be done for the profile, and no conformance tests are available. |
| |
| === Compliance |
| |
| This section defines when a TOSA implementation is compliant to a given TOSA specification profile and level. |
| To be compliant an implementation must achieve the results and accuracy defined by this specification. |
| TOSA also defines a set of conformance tests. |
| A compliant implementation must pass the conformance tests. |
| The conformance tests are not exhaustive, so an implementation that passes the conformance tests may not be compliant if there is a non-compliance that is undetected by the tests. |
| |
| ==== Base Inference Profile Compliance |
| |
| The <<Operator Graphs>> section of this specification defines a TOSA graph and the behavior defined for a TOSA graph. |
| This behavior is captured in the pseudo-code function tosa_execute_graph(). |
| For a given input graph (with attributes) and input tensors there are three possible tosa_graph_result values after executing the graph: |
| |
| * tosa_unpredictable: The result of the graph on the given inputs cannot be relied upon. |
| * tosa_error: The graph does not meet the specification and is recognised as an illegal graph. |
| * tosa_valid: The result is defined and predictable and the list of output tensors defines the result. |
| |
| An implementation is compliant to the TOSA Baseline Inference Profile if it matches the above results as follows: |
| |
| * For tosa_unpredictable, the implementation can return whatever result it chooses (including error) |
| * For tosa_error, the implementation must return an error result (and there is no requirement on how much of the graph is executed, if any) |
| * For tosa_valid, the implementation must execute the entire graph without error and return the result defined by this specification. |
| |
| In terms of psuedo-code, if *graph* is a TOSA graph consisting of Baseline Inference Profile operators and *input_list* is a list of input tensors then the following test must pass. |
| |
| [source,c++] |
| ---- |
| bool tosa_test_compliance(tosa_graph_t graph, tosa_list_t input_list, tosa_level_t level) { |
| shape_list_t output_list_spec = tosa_allocate_list(tosa_output_shape(graph)); |
| shape_list_t output_list_test = tosa_allocate_list(tosa_output_shape(graph)); |
| tosa_graph_result = tosa_valid; // result starts as valid |
| tosa_nesting_depth = 0; // if/while nesting level |
| tosa_execute_graph(graph, input_list, output_list_spec, level); |
| if (tosa_graph_result == tosa_unpredictable) { |
| return true; // No requirement to match an unpredictable result |
| } |
| result_test = execute_implementation_under_test(graph, input_list, output_list_test); |
| if (tosa_graph_result == tosa_error) { |
| return result_test == tosa_error; // result must be an error |
| } |
| if (exact_tensor_match(output_list_spec, output_list_test)) { |
| // Predictable bit-exact value match required |
| return true; |
| } |
| return false; |
| } |
| ---- |
| |
| ==== Main Inference Profile Compliance |
| |
| A Main Inference compliant implementation must satisfy the following: |
| |
| * The implementation must meet <<Base Inference Profile Compliance>> for all Base inference complaint graphs |
| * The implementation must support all Main Inference operations using the datatype fp32_t |
| ** The operations must meet the precision requirements of <<Main Inference precision requirements>> |
| * The implementation must support all Main Inference operations using the datatype fp16_t |
| ** The operations must meet the precision requirements of <<Main Inference precision requirements>> |
| ** Note: These requirements allow fp16_t operations to be implemented using the fp32_t datatype |
| * The implementation must support all Main Inference operations using the datatype bf16_t |
| ** The operations must meet the precision requirements of <<Main Inference precision requirements>> |
| ** Note: These requirements allow bf16_t operations to be implemented using the fp32_t datatype |
| |
| As with <<Base Inference Profile Compliance>> the pseudo-code function tosa_execute_graph() can return one of three possible results. |
| A compliant implementation must satisfy the following: |
| |
| * For a graph returning tosa_error the implementation must also return an error |
| * For a graph returning tosa_valid the implementation must execute the entire graph without error |
| * For a graph returning tosa_valid and consisting only of integer operators the results must match exactly |
| |
| ===== Main Inference precision requirements |
| |
| In a compliant implementation, individual floating-point operations within the graph must meet the accuracy bounds listed in the table following. |
| In the table _ulp_ means unit of the last place. |
| The function tosa_reference_check_fp() defines the error range permitted by a given number of units of last place in this specification. |
| |
| NOTE: The error criteria in this section are at an early draft stage and are likely to change during conformance test development. |
| |
| The following criteria apply to all operations: |
| |
| * If any input is a NaN and the result is floating-point then the result must be a NaN |
| * If any input is a NaN and the operation is a comparison (greater, greater-equal, equal) then the result must be false |
| * if any input is a NaN and the operation is conversion to an integer or boolean then the result is unpredictable |
| |
| [cols="1,3"] |
| |=== |
| | Operation | Accuracy bound |
| |
| | <<ARGMAX>>, <<MAX_POOL2D>>, <<CLAMP>>, <<MAXIMUM>>, <<MINIMUM>>, <<ABS>>, <<NEGATE>>, <<SELECT>>, <<REDUCE_MAX>>, <<REDUCE_MIN>>, <<CONST>>, <<IDENTITY>> |
| | Non NaN results must be exact. |
| |
| | <<EQUAL>>, <<GREATER>>, <<GREATER_EQUAL>> |
| | The result must be exact with: + |
| (1) The sign of the zero is ignored + |
| (2) Infinities of the same sign compare as equal |
| |
| | <<CONV2D>>, <<CONV3D>>, <<DEPTHWISE_CONV2D>>, <<FULLY_CONNECTED>>, <<MATMUL>>, <<TRANSPOSE_CONV2D>> |
| | Each output can be expressed as a dot product of two input vectors. + |
| The dot product must meet the <<Dot product accuracy requirements>> |
| |
| | <<FFT2D>>, <<RFFT2D>> |
| | Each output can be expressed as a dot product of an input vector with a constant coefficient vector. + |
| The dot product must meet the <<Dot product accuracy requirements>> |
| |
| | <<ADD>>, <<MUL>>, <<SUB>>, <<CEIL>>, <<FLOOR>> |
| | Floating-point result overflows must be set to infinity of the correct sign. + |
| Floating-point result underflows must be set to zero of the correct sign. + |
| Addition of infinites of different signs must produce a NaN. + |
| Subtraction of infinities of the same sign must produce a NaN. + |
| Multiplication of an infinity by a zero must produce a NaN. + |
| Otherwise the result must be within 0.5 ulp of the mathematical result. |
| |
| | <<CAST>> |
| | Floating-point result overflows must be set to infinity of the correct sign. + |
| Floating-point result underflows must be set to zero of the correct sign. + |
| Cast from floating-point to integer result overflows must be saturated. + |
| Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. + |
| Otherwise cast to floating-point must be within 0.5 ulp of the mathematical result. |
| |
| | <<RECIPROCAL>> |
| | If the input is a zero or the result overlows the output must be an infinity of the same sign. + |
| If the input is an infinty or the result underflows the output must be a zero of the same sign. + |
| Otherwise:the result must be within 1 ulp of the mathematical result. |
| |
| | <<RSQRT>> |
| | If the input is less than zero the result must be a NaN. + |
| Otherwise if the input is a zero the output must be an infinity of the same sign. + |
| Otherwise the result must be within 2 ulp of the mathematical result. |
| |
| | <<LOG>>, <<ERF>> |
| | If the input to LOG is less than zero then the result must be a NaN. + |
| If the result overflows the output must be an infinity of the correct sign. + |
| If the result underflows the output must be a zero of the correct sign. + |
| Otherwise the result must be within 5 ulp of the mathematical result. |
| |
| | <<EXP>> |
| | Let `x` be an input element and `out_imp` the implementation output of `exp(x)`. + |
| Let `out_ref` be the result of the fp64_t reference implementation of `exp(x)`. + |
| Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (1+abs(x))` + |
| Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true |
| |
| | <<POW>> |
| | Let `x`, `y` be input elements from `input1` and `input2` respectively. + |
| Let `out_imp` be the implementation output of `pow(x,y)`. + |
| If `x` is less than zero and `y` is non-integral then the result must be a NaN. + |
| Let `out_ref` be the result of the fp64_t reference implementation of `pow(x,y)`. + |
| Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (1+abs(log(abs(x))*y))` + |
| Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true |
| |
| | <<SIGMOID>> |
| | Let `x` be an input element and `out_imp` the implementation output. + |
| Let `out_ref` be the result of the fp64_t reference implementation. + |
| Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (2 * (1+abs(x)))` + |
| Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true |
| |
| | <<TANH>> |
| | Let `x` be an input element and `out_imp` the implementation output. + |
| Let `out_ref` be the result of the fp64_t reference implementation. + |
| Let `err_bnd = exp2(-normal_frac<in_out_t>) * max(0.5, abs(out_ref) * (4 * (1+abs(x))))` + |
| Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true |
| |
| | <<REDUCE_SUM>> |
| | Each output can be expressed as a dot product of an input vector with a vector of ones. + |
| This dot product must meet the <<Dot product accuracy requirements>> |
| |
| | <<AVG_POOL2D>> |
| | Each output can be expressed as a dot product of an input vector with a vector with elements 1/KS where KS is the kernel size. + |
| This dot product must meet the <<Dot product accuracy requirements>> |
| |
| | <<REDUCE_PRODUCT>> |
| | Result overflows must be set to an infinity of the correct sign. + |
| Result underflows must be set to a zero of the correct sign. + |
| Let n be number of elements in the product, out_imp the implementation result, and out_ref the result of the fp64_t reference implementation. + |
| Let `err_bnd = abs(out_ref) * (pow(1 + pow(2, -normal_frac<in_out_t> - 1), n) - 1)` + |
| Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true |
| |
| |=== |
| |
| ===== Operator sequence precision requirement |
| |
| Precision criteria are specified for a single operator. |
| |
| An implementation M of a sequence of n TOSA operators, A[0] to A[n-1] is said to |
| be compliant if M gives the same result as a sequence of implementations |
| M[0] to M[n-1] such that: |
| |
| * Each M[k] implements A[k] with same or higher precision datatypes |
| * Each M[k] meets the accuracy defined in this specification for A[k] where the M[k] output is converted to A[k] output precision using round to nearest |
| |
| ===== Dot product accuracy requirements |
| |
| This section assumes an operation acting on tensors named 'input', 'weight' and optionally 'bias'. |
| Each output tensor element can be expressed as a dot product of elements between the 'input' and 'weight' tensors with optional bias addition. |
| The dot product has length KS, the kernel size. |
| If the operation does not specify a bias then 'bias' is taken to be zero in this section. |
| Note: KS is defined for each relevant operator in the appendix section <<Main Inference operator test data>>. |
| |
| In other words, each output element `out` can be expressed as a dot product between input elements `in[k]`, weight elements `w[k]`, bias `b`: |
| |
| `out = in[0] * w[0] + in[1] * w[1] + ... + in[KS-1] * w[KS-1] + b` |
| |
| The positions of `in[k]`, `w[k]`, `b` in the input, weight and bias tensors depends on the operation being performed. |
| This may be, for example, a convolution. |
| |
| This section defines the accuracy required for these operations. |
| In this section: |
| |
| * "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (<<Other publications>>[1]) |
| * `operation_fp64()` is an fp64 reference implementation of the operation |
| * `operation_imp()` is the implementation under test |
| * `local_bound` is defined as follows: |
| ** For operations with a local_bound attribute it is the value of the optional attribute, with default value of false |
| ** For operations that do not have a local_bound attribute the value is true |
| |
| The checks described in the following code must pass for the following data sets: |
| |
| * Data sets defined for the operation in Appendix A <<Main Inference operator test data>>. |
| * Data sets that have at least MIN_DOT_PRODUCT different output values. For these data sets we take S=-1. |
| |
| [source,c++] |
| ---- |
| output_ref = operation_fp64(input, weight, bias); |
| output_imp = operation_imp (input, weight, bias); |
| input_abs = abs(input); // Element-wise absolute |
| weight_abs = abs(weight); // Element-wise absolute |
| bias_abs = abs(bias); // Element-wise absolute |
| if (!local_bound) { |
| input_abs_max = max_value(input_abs); // maximum over all elements |
| for_each(index in shape(input_abs) { |
| input_abs[index] = input_abs_max; // set all entries to global maximum |
| } |
| } |
| output_bnd = operation_fp64(input_abs, weight_abs, bias_abs); |
| |
| size_t T = tensor_size(output_shape) // number dot product results |
| size_t ksb = (max_value(bias_abs) > 0) ? (KS + 1) : KS; // kernel size and bias |
| fp64_t out_err_sum = 0.0; |
| fp64_t out_err_sumsq = 0.0; |
| for_each(index in output_shape) { |
| fp64_t out_bnd = tensor_read<fp64_t>(output_bnd, output_shape, index); |
| fp64_t out_ref = tensor_read<fp64_t>(output_ref, output_shape, index); |
| acc_t out_imp = tensor_read<acc_t> (output_imp, output_shape, index); |
| fp64_t out_err; |
| if ((acc_t)out_bnd == infinity) { |
| // dot product can overflow and there is no accuracy limit |
| out_err = 0.0; |
| } else if (out_bnd == 0.0) { |
| REQUIRE(out_ref == 0.0 && out_imp == 0.0); |
| out_err = 0.0; |
| } else { // 0.0 < out_bnd < infinity |
| fp64_t out_err_bnd = max(out_bnd * exp2(-1-normal_frac<acc_t>()), normal_min<acc_t>()); |
| out_err = (static_cast<fp64_t>(out_imp) - out_ref) / out_err_bnd; |
| REQUIRE(abs(out_err) <= ksb); |
| } |
| out_err_sum += out_err; |
| out_err_sumsq += out_err * out_err; |
| } |
| if (input and weights are data set S with 3 <= S <= 5) { |
| // check output error bias magnitude for data sets S which are not positive biased |
| REQUIRE(abs(out_err_sum) <= 2*sqrt(ksb*T)); |
| } |
| // check output error variance magnitude |
| REQUIRE(out_err_sumsq <= 0.4*ksb*T) |
| ---- |
| |
| === Tensor Definitions |
| |
| ==== Tensors |
| |
| Tensors are multidimensional arrays of data. |
| Tensors have metadata associated with them that describe characteristics of the tensor, including: |
| |
| * Data Type |
| * Shape |
| |
| The number of dimensions in a shape is called the rank. |
| A tensor with rank equal to zero is permitted. |
| In that case, the tensor has a single entry and is also known as a scalar. |
| A tensor shape is an array of integers of size equal to the rank of the tensor. |
| Each element in the tensor shape describes the number of elements in the dimension. |
| The tensor shape in each dimension must be greater than or equal to 1. |
| For tensor access information, see <<Tensor Access Helpers>>. |
| |
| The shape of a tensor of non-zero rank is a special type shape_t. |
| shape_t is a one-dimensional list with the size equal to the rank of the original tensor. |
| The components of a shape_t are of type size_t. |
| |
| In this version of the specification, shape_t values must be resolvable to constants at backend compile time. |
| |
| ==== Tensor size limit |
| |
| The tensor overall size is limited by the data type size_t. |
| This type must be able to hold integers in the range 0 to (1 << (MAX_LOG2_SIZE + 1)) - 1 where MAX_LOG2_SIZE is defined in <<Levels>>. |
| For each tensor, the number of tensor elements multiplied by the element size in bytes (which is taken to be 1 for elements smaller than a 8-bit) must be less than or equal to (1 << (MAX_LOG2_SIZE + 1)) - 1. |
| |
| The size of tensors along each of their dimensions is limited by the data type size_t. |
| |
| This means that the maximum size of a tensor along each dimension is (1 << MAX_LOG2_SIZE) - 1 and therefore the maximum coordinate value is (1 << MAX_LOG2_SIZE) - 2. |
| Indices used to access tensors must be non-negative. |
| |
| |
| ==== Data Layouts |
| |
| The following data layouts are supported in TOSA. |
| TOSA operations are defined in terms of a linear packed tensor layout. |
| In a linear packed layout a rank r tensor has elements of dimension (r-1) consecutive. |
| The next to increment is dimension (r-2) and so on. |
| For a specification of this layout see the tensor read and write functions in section <<Tensor Access Helpers>>. |
| |
| An implementation of TOSA can choose a different tensor memory layout provided that the operation behavior is maintained. |
| |
| .Data Layouts |
| [cols="1,4,4"] |
| |=== |
| |Name|Description of dimensions|Usage |
| |
| |NHWC|Batch, Height, Width, Channels|Feature maps |
| |NDHWC|Batch, Depth, Height, Width, Channels|Feature maps for 3D convolution |
| |OHWI|Output channels, Filter Height, Filter Width, Input channels|Weights |
| |HWIM|Filter Height, Filter Width, Input channels, Channel Multiplier|Weights for depthwise convolutions |
| |DOHWI|Depth, Output Channels, Filter Height, Filter Width, Input Channels|Weights for 3D convolution |
| |=== |
| |
| ==== Broadcasting |
| |
| In operations where broadcasting is supported, an input shape dimension can be broadcast to an output shape dimension if the input shape dimension is 1. |
| TOSA broadcast requires the rank of both tensors to be the same. |
| A RESHAPE can be done to create a compatible tensor with appropriate dimensions of size 1. |
| To map indexes in an output tensor to that of an input tensor, see <<Broadcast Helper>>. |
| |
| ==== Supported Number Formats |
| |
| The following number formats are defined in TOSA. |
| The number formats supported by a given operator are listed in its table of supported types. |
| |
| .Number formats |
| [cols="1,1,1,5"] |
| |=== |
| |Format|Minimum|Maximum|Description |
| |
| |bool_t |
| | - |
| | - |
| |Boolean value that is either `true` or `false`. Size implementation defined. The TOSA reference model implements this as int8_t with 0 for `false` and 1 for `true`. All non-zero values are accepted on input as `true`. |
| |
| |i4_t |
| | - |
| | - |
| |Signless 4-bit integer type. Will be interpreted as int4_t by all operators |
| |
| |int4_t |
| | -7 |
| | +7 |
| |Signed 4-bit two's-complement value. Excludes -8 to maintain a symmetric about zero range for weights. |
| |
| |i8_t |
| | - |
| | - |
| |Signless 8-bit integer value. Will be interpreted as int8_t unless otherwise specified by an operator. |
| |
| |int8_t |
| | -128 |
| | +127 |
| |Signed 8-bit two's-complement value. |
| |
| |uint8_t |
| | 0 |
| | 255 |
| |Unsigned 8-bit integer value. |
| |
| |i16_t |
| | - |
| | - |
| |Signless 16-bit integer type. Will be interpreted as int16_t unless otherwise specified by an operator. |
| |
| |int16_t |
| | -32768 |
| | +32767 |
| |Signed 16-bit two's-complement value. |
| |
| |uint16_t |
| | 0 |
| | 65535 |
| |Unsigned 16-bit value. |
| |
| |i32_t |
| | - |
| | - |
| |Signless 32-bit integer value. Will be interpreted as int32_t by all operators. |
| |
| |int32_t |
| | -(1<<31) |
| | (1<<31)-1 |
| |Signed 32-bit two's-complement value. |
| |
| |i48_t |
| | - |
| | - |
| |Signless 48-bit integer value. Will be interpreted as int48_t by all operators. |
| |
| |int48_t |
| | -(1<<47) |
| | (1<<47)-1 |
| |Signed 48-bit two's-complement value. |
| |
| |fp16_t |
| | -infinity |
| | +infinity |
| | 16-bit half-precision floating-point defined by <<Other publications>>[1]. + |
| Normal values must be supported. + |
| Denormal values must either be supported or flushed to zero. + |
| Positive and negative infinity must be supported. + |
| At least one NaN encoding must be supported. + |
| Signed zero must be supported. |
| |
| |bf16_t |
| | -infinity |
| | +infinity |
| | 16-bit brain floating-point defined as bits [31:16] of the fp32_t format. + |
| Normal values must be supported. + |
| Denormal values must either be supported or flushed to zero. + |
| Positive and negative infinity must be supported. + |
| At least one NaN encoding must be supported. + |
| Signed zero must be supported. |
| |
| |fp32_t |
| | -infinity |
| | +infinity |
| | 32-bit single-precision floating-point defined by <<Other publications>>[1]. + |
| Normal values must be supported. + |
| Denormal values must either be supported or flushed to zero. + |
| Positive and negative infinity must be supported. + |
| At least one NaN encoding must be supported. + |
| Signed zero must be supported. |
| |
| |fp64_t |
| | -infinity |
| | + infinity |
| | 64-bit double-precision floating-point defined by <<Other publications>>[1]. + |
| Normal values must be supported. + |
| Denormal values must either be supported or flushed to zero. + |
| Positive and negative infinity must be supported. + |
| At least one NaN encoding must be supported. + |
| Signed zero must be supported. |
| |=== |
| |
| Note: In this specification minimum<type> and maximum<type> will denote the minimum and maximum values of the data as stored in memory (ignoring the zero point). |
| The minimum and maximum values for each type is given in the preceeding table. |
| |
| Note: Integer number formats smaller than 8 bits may be used provided that the numerical result is the same as using a sequence of 8-bit TOSA operations. |
| For example, a convolution with low precision data must equal that of running the convolution at 8 bits and then clipping the result to the peritted output range. |
| This ensures that a Base Inference profile TOSA implementation can calculate the same result. |
| |
| === Integer Behavior |
| |
| TOSA integer inputs and outputs are specified by signless values with the given number of bits. |
| Unless otherwise specified, these values will be interpreted as signed twos-complement. |
| The pseudocode will use int*_t to indicate use as a signed value and uint*_t to indicate use as an unsigned value. |
| If overflow occurs doing integer calculation, the result is unpredictable, as indicated by the REQUIRE checks in the pseudocode for the operators. |
| |
| Unsigned 8 and 16-bit values are only allowed in the RESCALE operation, to allow for compatibility with networks which expect unsigned 8-bit or 16-bit tensors for input and output. |
| |
| ==== Quantization |
| |
| Machine Learning frameworks may represent tensors with a quantized implementation, using integer values to represent the original floating-point numbers. |
| TOSA integer operations do not perform any implicit scaling to represent quantized values. |
| Required zero point values are passed to the operator as necessary, and will be processed according to the pseudocode for each operator. |
| |
| To convert a network containing quantized tensors to TOSA, generate explicit RESCALE operators for any change of quantization scaling. |
| This reduces quantized operations to purely integer operations. |
| |
| As an example, an ADD between two quantized tensors requires the integer values represent the same range. |
| The scale arguments for RESCALE can be calculated to ensure that the resulting tensors represent the same range. |
| Then the ADD is performed, and a RESCALE can be used to ensure that the result is scaled properly. |
| |
| RESCALE provides support for per-tensor and per-channel scaling values to ensure compatibility with a range of possible quantization implementations. |
| |
| |
| |
| ==== Precision scaling |
| |
| TOSA uses the RESCALE operation to scale between values with differing precision. |
| The RESCALE operator is defined using an integer multiply, add, and shift. |
| This guarantees that all TOSA implementations will return the same result for a RESCALE, including those with no support for floating-point numbers. |
| |
| This TOSA specification supports two precisions of multiplier: 16-bit and 32-bit. |
| The 32-bit multiplier version supports two rounding modes to enable simpler lowering of existing frameworks that use two stage rounding. |
| All arithmetic is designed so that it does not overflow a 64-bit accumulator and that the final result fits in 32 bits. |
| In particular a 48-bit value can only be scaled with the 16-bit multiplier. |
| |
| The apply_scale functions provide a scaling of approximately (multiplier * 2^-shift^). |
| The shift and value range is limited to allow a variety of implementations. |
| The limit of 62 on shift allows the shift to be decomposed as two right shifts of 31. |
| The limit on value allows implementations that left shift the value before the multiply in the case of shifts of 32 or less. |
| For example, in the case shift=30 an implementation of the form ((value\<<2) * multiplier + round)>>32 can be used. |
| A scaling range of 2^+12^ down to 2^-32^ is supported for both functions with a normalized multiplier. |
| |
| For example, in typical usage a scaling of m*2^-n^ where m is a fraction in the |
| range 1.0 \<= m < 2.0 can be represented using multiplier=(1<<30)*m, shift=(30+n) for |
| apply_scale_32() and multiplier=(1<<14)*m, shift=(14+n) for apply_scale_16(). |
| The values to achieve a scaling of 1.0 are shift=30, multiplier=1<<30 for apply_scale_32 and shift=14, multiplier=1<<14 for apply_scale_16. |
| |
| [source,c++] |
| ---- |
| int32_t apply_scale_32(int32_t value, int32_t multiplier, int8_t shift, bool_t double_round=false) { |
| REQUIRE(multiplier >= 0); |
| REQUIRE(2 <= shift && shift <= 62); |
| REQUIRE(value >= (-1 << (shift - 1)) && value < (1 << (shift - 1))); |
| int64_t round = 1 << (shift - 1); |
| if (double_round) { |
| if (shift > 31 && value >= 0) round += 1<<30; |
| if (shift > 31 && value < 0) round -= 1<<30; |
| } |
| int64_t result = static_cast<int64_t>(value) * multiplier + round; |
| result = result >> shift; |
| // result will fit a 32-bit range due to the REQUIRE on value |
| return static_cast<int32_t>(result); |
| } |
| |
| int32_t apply_scale_16(int48_t value, int16_t multipler, int8_t shift) { |
| REQUIRE(multiplier >= 0); |
| REQUIRE(2 <= shift && shift <= 62); |
| int64_t round = (1 << (shift - 1)); |
| int64_t result = static_cast<int64_t>(value) * multiplier + round; |
| result = result >> shift; |
| REQUIRE(result >= minimum<int32_t> && result <= maximum<int32_t>); |
| return static_cast<int32_t>(result); |
| } |
| ---- |
| |
| In some functions, the multiplier and shift are combined into a scale_t structure: |
| |
| [source,c++] |
| ---- |
| typedef struct { |
| int32_t multiplier; |
| int8_t shift; |
| } scale_t; |
| ---- |
| |
| In places where a divide is required, we also use the function below to calculate an appropriate scaling value. |
| |
| [source,c++] |
| ---- |
| scale_t reciprocal_scale(uint32_t value) { |
| REQUIRE(value > 0); |
| scale_t scale; |
| int32_t k = 32 - count_leading_zeros(value - 1); // (1 << k) / 2 < value <= (1 << k) |
| int64_t numerator = ((1 << 30) + 1) << k; |
| scale.multiplier = numerator / value; // (1 << 30) <= multiplier < (1 << 31) |
| scale.shift = 30 + k; |
| return scale; |
| } |
| ---- |
| |
| ==== Integer Convolutions |
| |
| For the convolution operators, the input is not required to be scaled. |
| The integer versions of the convolution operators will subtract the zero point from the integer values as defined for each operator. |
| The convolution produces an accumulator output of type int32_t or int48_t. |
| This accumulator output is then scaled to the final output range using the RESCALE operator. |
| The scale applied in the RESCALE operator should be set to multiplier and shift values such that: multiplier * 2^-shift^ = (input scale * weight scale) / output_scale. |
| Here, input_scale, weight_scale and output_scale are the conversion factors from integer to floating-point for the input, weight and output tensor values respectively. |
| If per-channel scaling is needed then the per-channel option of the RESCALE operation should be used. |
| |
| ==== Integer Elementwise Operators |
| |
| When two quantized tensors are used in an operation, they must represent the same numeric range for the result to be valid. |
| In this case, TOSA expects that RESCALE operators will be used as necessary to generate 32-bit integer values in a common range. |
| There are many valid choices for scale factors and options for the common range. |
| TOSA does not impose a requirement on which scale factors and range should be used. |
| Compilers generating TOSA sequences should choose a range that allows the operation to be computed without overflow, while allowing the highest possible accuracy of the output. |
| |
| ==== General Unary Functions |
| General unary functions such as sigmoid(), tanh(), exp() for integer inputs are expressed using a lookup table and interpolation to enable efficient implementation. |
| This also allows for other operations with the addition of user-supplied tables (the TABLE operation). |
| All table lookups are based on the following reference lookup function that takes as input a table of 513 entries of 16 bits each. |
| |
| [source,c++] |
| ---- |
| int32_t apply_lookup_s(int16_t *table, int32_t value) |
| { |
| int16_t clipped_value = static_cast<int16_t>(apply_clip_s<int32_t>(value, -32768, +32767)); |
| int32_t index = (clipped_value + 32768) >> 7; |
| int32_t fraction = clipped_value & 0x7f; |
| int16_t base = table[index]; |
| int16_t next = table[index+1]; |
| int32_t slope = next - base; |
| REQUIRE(slope >= minimum<int16_t> && slope <= maximum<int16_t>) |
| int32_t return_value = (base << 7) + slope * fraction; |
| return return_value; // return interpolated value of 16 + 7 = 23 bits |
| } |
| ---- |
| |
| Note that although the table lookup defined here has 16-bit precision, for 8-bit only operations an 8-bit table can be derived by applying the reference function to each of the possible 256 input values. |
| The following code constructs a 513-entry table based on a reference function. |
| |
| [source,c++] |
| ---- |
| void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t)) |
| { |
| for (int i = -256; i <= 256; i++) { |
| int32_t value = (*reference)(i); |
| table[i + 256] = static_cast<int16_t>(apply_clip_s<int32_t>(value, -32768, +32767)); |
| } |
| } |
| ---- |
| |
| === Other publications |
| |
| The following publications are referred to in this specification, or provide more information: |
| |
| . IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008. |