chapters/introduction.adoc - tosa/specification - Gitiles

 //
 // This confidential and proprietary software may be used only as
 // authorised by a licensing agreement from ARM Limited
 // (C) COPYRIGHT 2020-2024 ARM Limited
 // ALL RIGHTS RESERVED
 // The entire notice above must be reproduced on all authorised
 // copies and copies may only be made to the extent permitted
 // by a licensing agreement from ARM Limited.

 == Introduction

 === Overview

 Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor
 operations commonly employed by Deep Neural Networks. The intent is to enable a
 variety of implementations running on a diverse range of processors, with the
 results at the TOSA level consistent across those implementations. Applications
 or frameworks which target TOSA can therefore be deployed on a wide range of
 different processors, such as SIMD CPUs, GPUs and custom hardware such as
 NPUs/TPUs, with defined accuracy and compatibility constraints. Most operators
 from the common ML frameworks (TensorFlow, PyTorch, etc.) should be expressible
 in TOSA. It is expected that there will be tools to lower from ML frameworks
 into TOSA.

 === Goals

 The goals of TOSA include the following:

 * A minimal and stable set of tensor-level operators to which machine learning
 framework operators can be reduced.

 * Full support for both quantized integer and floating-point content.

 * Precise functional description of the behavior of every operator, including
 the treatment of their numerical behavior in the case of precision, saturation,
 scaling, and range as required by quantized datatypes.

 * Agnostic to any single high-level framework, compiler backend stack or
 particular target.

 * The detailed functional and numerical description enables precise code
 construction for a diverse range of targets – SIMD CPUs, GPUs and custom
 hardware such as NPUs/TPUs.

 === Specification

 The TOSA Specification is written as AsciiDoc mark-up and developed in its raw
 mark-up form, managed through a git repository here:
 https://git.mlplatform.org/tosa/specification.git/.
 The specification is developed and versioned much like software.
 While the mark-up is legible and can be read fairly easily in its raw form, it is recommended to build or “render” the mark-up into PDF or HTML.
 To do this, please follow the instructions in the README.md in the root of the specification repository.

 === Operator Selection Principles

 TOSA defines a set of primitive operators to which higher level operators can be lowered in a consistent way.
 To remain effective and efficient to implement, the set of operators must be constrained to a reasonably small set of primitive operations out of which others can be constructed.
 The following principles govern the selection of operators within TOSA.

 .Principles
 [cols="1,5,5"]
 |===
 |ID|Principle|Reason for this

 |P0
 |An operator shall be a primitive operation or building block that cannot be decomposed into simpler whole tensor operations.
 |If the operator can be broken down, then we should look at the component operators.

 |P1
 |An operator shall be a usable as a component out of which more complex operations can be constructed.
 |Single use operators have a high architectural cost and a more reusable version should be considered instead.

 |P2
 |Precision should be appropriate for the input and output data types.
 |Precision higher than that needed to calculate the result leads to extra implementation cost.

 |P3
 |Numerical definition of common sub-operations should be consistent between operators (for example: value scaling).
 |Consistent sub-operation definition reduces the operator implementation cost.

 |P4
 |The valid input and output ranges for all arguments shall be specified.
 |Ranges are required to make consistent (numerically agreeing) implementations possible.

 |P5
 |Integer operators shall be implementable in a bit-exact form with good efficiency on CPU, GPU and hardware targets.
 |Reduces implementation cost and gives consistent inference results.
 |===

 === Profiles

 TOSA supports three profiles that enable efficient implementation on different classes of device.
 The Base Inference profile is intended for embedded integer/fixed-point designs performing inference only.
 The Main Inference profile is intended for general inference functionality including integer and floating-point data types.
 The Main Training profile adds training operators in addition to inference operators.
 This version of the specification covers the Base Inference and Main Inference profiles.
 Main Training profile is expected in a later version of the specification.
 The following table summarizes the three profiles:

 .Profiles
 |===
 |Profile|Name|Integer Inference|Floating-point Inference|Training

 |Base Inference|TOSA-BI|Yes|No|No
 |Main Inference|TOSA-MI|Yes|Yes|No
 |Main Training|TOSA-MT|Yes|Yes|Yes
 |===

 === Levels

 A TOSA level defines operator argument ranges that an implementation shall support.
 This is distinct from a profile that defines the operations and data-types supported.
 This version of the specification defines two TOSA levels:

 * No level : allows the full range of arguments specified by the operations according to the operation data types.
 * Level 8K : ranges are expected to be sufficient for applications with frame sizes up to 8K.

 Later versions of the specification may define additional levels.
 The following table defines the value ranges for Level 1.0.
 These ranges are checked using the LEVEL_CHECK() function with the operator descriptions.

 .Level maximums
 include::{generated}/levels.adoc[]

 === Status

 The TOSA specification is a work in progress.

 * The Base Inference profile should be considered to be near release quality, with conformance tests available.
 * The Main Inference profile has most of the expected operators in place, but is still subject to change.
 * The reference model and conformance tests do not yet support all of the floating point types that have been defined.
 * There is not currently a conformance test suite available for Main Inference.
 * Main Training profile is pre-alpha, significant work still needs to be done for the profile, and no conformance tests are available.

 === Compliance

 This section defines when a TOSA implementation is compliant to a given TOSA specification profile and level.
 To be compliant an implementation must achieve the results and accuracy defined by this specification.
 TOSA also defines a set of conformance tests.
 A compliant implementation must pass the conformance tests.
 The conformance tests are not exhaustive, so an implementation that passes the conformance tests may not be compliant if there is a non-compliance that is undetected by the tests.

 ==== Base Inference Profile Compliance

 The <<Operator Graphs>> section of this specification defines a TOSA graph and the behavior defined for a TOSA graph.
 This behavior is captured in the pseudo-code function tosa_execute_graph().
 For a given input graph (with attributes) and input tensors there are three possible tosa_graph_result values after executing the graph:

 * tosa_unpredictable: The result of the graph on the given inputs cannot be relied upon.
 * tosa_error: The graph does not meet the specification and is recognised as an illegal graph.
 * tosa_valid: The result is defined and predictable and the list of output tensors defines the result.

 An implementation is compliant to the TOSA Baseline Inference Profile if it matches the above results as follows:

 * For tosa_unpredictable, the implementation can return whatever result it chooses (including error)
 * For tosa_error, the implementation must return an error result (and there is no requirement on how much of the graph is executed, if any)
 * For tosa_valid, the implementation must execute the entire graph without error and return the result defined by this specification.

 In terms of psuedo-code, if *graph* is a TOSA graph consisting of Baseline Inference Profile operators and *input_list* is a list of input tensors then the following test must pass.

 [source,c++]
 ----
 bool tosa_test_compliance(tosa_graph_t graph, tosa_list_t input_list, tosa_level_t level) {
     shape_list_t output_list_spec = tosa_allocate_list(tosa_output_shape(graph));
     shape_list_t output_list_test = tosa_allocate_list(tosa_output_shape(graph));
     tosa_graph_result = tosa_valid;    // result starts as valid
     tosa_nesting_depth = 0;            // if/while nesting level
     tosa_execute_graph(graph, input_list, output_list_spec, level);
     if (tosa_graph_result == tosa_unpredictable) {
         return true;    // No requirement to match an unpredictable result
     }
     result_test = execute_implementation_under_test(graph, input_list, output_list_test);
     if (tosa_graph_result == tosa_error) {
         return result_test == tosa_error;   // result must be an error
     }
     if (exact_tensor_match(output_list_spec, output_list_test)) {
        // Predictable bit-exact value match required
        return true;
     }
     return false;
 }
 ----

 ==== Main Inference Profile Compliance

 A Main Inference compliant implementation must satisfy the following:

 * The implementation must meet <<Base Inference Profile Compliance>> for all Base inference complaint graphs
 * The implementation must support all Main Inference operations using the datatype fp32_t
 ** The operations must meet the precision requirements of <<Main Inference precision requirements>>
 * The implementation must support all Main Inference operations using the datatype fp16_t
 ** The operations must meet the precision requirements of <<Main Inference precision requirements>>
 ** Note: These requirements allow fp16_t operations to be implemented using the fp32_t datatype
 * The implementation must support all Main Inference operations using the datatype bf16_t
 ** The operations must meet the precision requirements of <<Main Inference precision requirements>>
 ** Note: These requirements allow bf16_t operations to be implemented using the fp32_t datatype

 As with <<Base Inference Profile Compliance>> the pseudo-code function tosa_execute_graph() can return one of three possible results.
 A compliant implementation must satisfy the following:

 * For a graph returning tosa_error the implementation must also return an error
 * For a graph returning tosa_valid the implementation must execute the entire graph without error
 * For a graph returning tosa_valid and consisting only of integer operators the results must match exactly

 ===== Main Inference precision requirements

 In a compliant implementation, individual floating-point operations within the graph must meet the accuracy bounds listed in the table following.
 In the table _ulp_ means unit of the last place.
 The function tosa_reference_check_fp() defines the error range permitted by a given number of units of last place in this specification.

 NOTE: The error criteria in this section are at an early draft stage and are likely to change during conformance test development.

 The following criteria apply to all operations:

 * If any input is a NaN and the result is floating-point then the result must be a NaN
 * If any input is a NaN and the operation is a comparison (greater, greater-equal, equal) then the result must be false
 * if any input is a NaN and the operation is conversion to an integer or boolean then the result is unpredictable

 [cols="1,3"]
 |===
 | Operation | Accuracy bound

 | <<ARGMAX>>, <<MAX_POOL2D>>, <<CLAMP>>, <<MAXIMUM>>, <<MINIMUM>>, <<ABS>>, <<NEGATE>>, <<SELECT>>, <<REDUCE_MAX>>, <<REDUCE_MIN>>, <<CONST>>, <<IDENTITY>>
 | Non NaN results must be exact.

 | <<EQUAL>>, <<GREATER>>, <<GREATER_EQUAL>>
 | The result must be exact with: +
 (1) The sign of the zero is ignored +
 (2) Infinities of the same sign compare as equal

 | <<CONV2D>>, <<CONV3D>>, <<DEPTHWISE_CONV2D>>, <<FULLY_CONNECTED>>, <<MATMUL>>, <<TRANSPOSE_CONV2D>>
 | Each output can be expressed as a dot product of two input vectors. +
 The dot product must meet the <<Dot product accuracy requirements>>

 | <<FFT2D>>, <<RFFT2D>>
 | Each output can be expressed as a dot product of an input vector with a constant coefficient vector. +
 The dot product must meet the <<Dot product accuracy requirements>>

 | <<ADD>>, <<MUL>>, <<SUB>>, <<CEIL>>, <<FLOOR>>
 | Floating-point result overflows must be set to infinity of the correct sign. +
 Floating-point result underflows must be set to zero of the correct sign. +
 Addition of infinites of different signs must produce a NaN. +
 Subtraction of infinities of the same sign must produce a NaN. +
 Multiplication of an infinity by a zero must produce a NaN. +
 Otherwise the result must be within 0.5 ulp of the mathematical result.

 | <<CAST>>
 | Floating-point result overflows must be set to infinity of the correct sign. +
 Floating-point result underflows must be set to zero of the correct sign. +
 Cast from floating-point to integer result overflows must be saturated. +
 Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. +
 Otherwise cast to floating-point must be within 0.5 ulp of the mathematical result.

 | <<RECIPROCAL>>
 | If the input is a zero or the result overlows the output must be an infinity of the same sign. +
 If the input is an infinty or the result underflows the output must be a zero of the same sign. +
 Otherwise:the result must be within 1 ulp of the mathematical result.

 | <<RSQRT>>
 | If the input is less than zero the result must be a NaN. +
 Otherwise if the input is a zero the output must be an infinity of the same sign. +
 Otherwise the result must be within 2 ulp of the mathematical result.

 | <<LOG>>, <<ERF>>
 | If the input to LOG is less than zero then the result must be a NaN. +
 If the result overflows the output must be an infinity of the correct sign. +
 If the result underflows the output must be a zero of the correct sign. +
 Otherwise the result must be within 5 ulp of the mathematical result.

 | <<EXP>>
 | Let `x` be an input element and `out_imp` the implementation output of `exp(x)`. +
 Let `out_ref` be the result of the fp64_t reference implementation of `exp(x)`. +
 Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (1+abs(x))` +
 Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true

 | <<POW>>
 | Let `x`, `y` be input elements from `input1` and `input2` respectively. +
 Let `out_imp` be the implementation output of `pow(x,y)`. +
 If `x` is less than zero and `y` is non-integral then the result must be a NaN. +
 Let `out_ref` be the result of the fp64_t reference implementation of `pow(x,y)`. +
 Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (1+abs(log(abs(x))*y))` +
 Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true

 | <<SIGMOID>>
 | Let `x` be an input element and `out_imp` the implementation output. +
 Let `out_ref` be the result of the fp64_t reference implementation. +
 Let `err_bnd = abs(out_ref) * exp2(-normal_frac<in_out_t>) * (2 * (1+abs(x)))` +
 Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true

 | <<TANH>>
 | Let `x` be an input element and `out_imp` the implementation output. +
 Let `out_ref` be the result of the fp64_t reference implementation. +
 Let `err_bnd = exp2(-normal_frac<in_out_t>) * max(0.5, abs(out_ref) * (4 * (1+abs(x))))` +
 Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true

 | <<REDUCE_SUM>>
 | Each output can be expressed as a dot product of an input vector with a vector of ones. +
 This dot product must meet the <<Dot product accuracy requirements>>

 | <<AVG_POOL2D>>
 | Each output can be expressed as a dot product of an input vector with a vector with elements 1/KS where KS is the kernel size. +
 This dot product must meet the <<Dot product accuracy requirements>>

 | <<REDUCE_PRODUCT>>
 | Result overflows must be set to an infinity of the correct sign. +
 Result underflows must be set to a zero of the correct sign. +
 Let n be number of elements in the product, out_imp the implementation result, and out_ref the result of the fp64_t reference implementation. +
 Let `err_bnd = abs(out_ref) * (pow(1 + pow(2, -normal_frac<in_out_t> - 1), n) - 1)` +
 Then `tosa_reference_check_fp_bnd<in_out_t>(out_imp, out_ref, err_bnd)` must be true

 |===

 ===== Operator sequence precision requirement

 Precision criteria are specified for a single operator.

 An implementation M of a sequence of n TOSA operators, A[0] to A[n-1] is said to
 be compliant if M gives the same result as a sequence of implementations
 M[0] to M[n-1] such that:

 * Each M[k] implements A[k] with same or higher precision datatypes
 * Each M[k] meets the accuracy defined in this specification for A[k] where the M[k] output is converted to A[k] output precision using round to nearest

 ===== Dot product accuracy requirements

 This section assumes an operation acting on tensors named 'input', 'weight' and optionally 'bias'.
 Each output tensor element can be expressed as a dot product of elements between the 'input' and 'weight' tensors with optional bias addition.
 The dot product has length KS, the kernel size.
 If the operation does not specify a bias then 'bias' is taken to be zero in this section.
 Note: KS is defined for each relevant operator in the appendix section <<Main Inference operator test data>>.

 In other words, each output element `out` can be expressed as a dot product between input elements `in[k]`, weight elements `w[k]`, bias `b`:

 `out = in[0] * w[0] + in[1] * w[1] + ... + in[KS-1] * w[KS-1] + b`

 The positions of `in[k]`, `w[k]`, `b` in the input, weight and bias tensors depends on the operation being performed.
 This may be, for example, a convolution.

 This section defines the accuracy required for these operations.
 In this section:

 * "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (<<Other publications>>[1])
 * `operation_fp64()` is an fp64 reference implementation of the operation
 * `operation_imp()` is the implementation under test
 * `local_bound` is defined as follows:
 **  For operations with a local_bound attribute it is the value of the optional attribute, with default value of false
 **  For operations that do not have a local_bound attribute the value is true

 The checks described in the following code must pass for the following data sets:

 * Data sets defined for the operation in Appendix A <<Main Inference operator test data>>.
 * Data sets that have at least MIN_DOT_PRODUCT different output values. For these data sets we take S=-1.

 [source,c++]
 ----
 output_ref = operation_fp64(input, weight, bias);
 output_imp = operation_imp (input, weight, bias);
 input_abs  = abs(input);   // Element-wise absolute
 weight_abs = abs(weight);  // Element-wise absolute
 bias_abs   = abs(bias);    // Element-wise absolute
 if (!local_bound) {
     input_abs_max = max_value(input_abs);  // maximum over all elements
     for_each(index in shape(input_abs) {
         input_abs[index] = input_abs_max;  // set all entries to global maximum
     }
 }
 output_bnd = operation_fp64(input_abs, weight_abs, bias_abs);

 size_t T = tensor_size(output_shape)  // number dot product results
 size_t ksb = (max_value(bias_abs) > 0) ? (KS + 1) : KS; // kernel size and bias
 fp64_t out_err_sum = 0.0;
 fp64_t out_err_sumsq = 0.0;
 for_each(index in output_shape) {
     fp64_t out_bnd = tensor_read<fp64_t>(output_bnd, output_shape, index);
     fp64_t out_ref = tensor_read<fp64_t>(output_ref, output_shape, index);
     acc_t  out_imp = tensor_read<acc_t> (output_imp, output_shape, index);
     fp64_t out_err;
     if ((acc_t)out_bnd == infinity) {
         // dot product can overflow and there is no accuracy limit
         out_err = 0.0;
     } else if (out_bnd == 0.0) {
         REQUIRE(out_ref == 0.0 && out_imp == 0.0);
         out_err = 0.0;
     } else {  // 0.0 < out_bnd < infinity
         fp64_t out_err_bnd = max(out_bnd * exp2(-1-normal_frac<acc_t>()), normal_min<acc_t>());
         out_err = (static_cast<fp64_t>(out_imp) - out_ref) / out_err_bnd;
         REQUIRE(abs(out_err) <= ksb);
     }
     out_err_sum   += out_err;
     out_err_sumsq += out_err * out_err;
 }
 if (input and weights are data set S with 3 <= S <= 5) {
     // check output error bias magnitude for data sets S which are not positive biased
     REQUIRE(abs(out_err_sum) <= 2*sqrt(ksb*T));
 }
 // check output error variance magnitude
 REQUIRE(out_err_sumsq <= 0.4*ksb*T)
 ----

 === Tensor Definitions

 ==== Tensors

 Tensors are multidimensional arrays of data.
 Tensors have metadata associated with them that describe characteristics of the tensor, including:

 * Data Type
 * Shape

 The number of dimensions in a shape is called the rank.
 A tensor with rank equal to zero is permitted.
 In that case, the tensor has a single entry and is also known as a scalar.
 A tensor shape is an array of integers of size equal to the rank of the tensor.
 Each element in the tensor shape describes the number of elements in the dimension.
 The tensor shape in each dimension must be greater than or equal to 1.
 For tensor access information, see <<Tensor Access Helpers>>.

 The shape of a tensor of non-zero rank is a special type shape_t.
 shape_t is a one-dimensional list with the size equal to the rank of the original tensor.
 The components of a shape_t are of type size_t.

 In this version of the specification, shape_t values must be resolvable to constants at backend compile time.

 ==== Tensor size limit

 The tensor overall size is limited by the data type size_t.
 This type must be able to hold integers in the range 0 to (1 << (MAX_LOG2_SIZE + 1)) - 1 where MAX_LOG2_SIZE is defined in <<Levels>>.
 For each tensor, the number of tensor elements multiplied by the element size in bytes (which is taken to be 1 for elements smaller than a 8-bit) must be less than or equal to (1 << (MAX_LOG2_SIZE + 1)) - 1.

 The size of tensors along each of their dimensions is limited by the data type size_t.

 This means that the maximum size of a tensor along each dimension is (1 << MAX_LOG2_SIZE) - 1 and therefore the maximum coordinate value is (1 << MAX_LOG2_SIZE) - 2.
 Indices used to access tensors must be non-negative.


 ==== Data Layouts

 The following data layouts are supported in TOSA.
 TOSA operations are defined in terms of a linear packed tensor layout.
 In a linear packed layout a rank r tensor has elements of dimension (r-1) consecutive.
 The next to increment is dimension (r-2) and so on.
 For a specification of this layout see the tensor read and write functions in section <<Tensor Access Helpers>>.

 An implementation of TOSA can choose a different tensor memory layout provided that the operation behavior is maintained.

 .Data Layouts
 [cols="1,4,4"]
 |===
 |Name|Description of dimensions|Usage

 |NHWC|Batch, Height, Width, Channels|Feature maps
 |NDHWC|Batch, Depth, Height, Width, Channels|Feature maps for 3D convolution
 |OHWI|Output channels, Filter Height, Filter Width, Input channels|Weights
 |HWIM|Filter Height, Filter Width, Input channels, Channel Multiplier|Weights for depthwise convolutions
 |DOHWI|Depth, Output Channels, Filter Height, Filter Width, Input Channels|Weights for 3D convolution
 |===

 ==== Broadcasting

 In operations where broadcasting is supported, an input shape dimension can be broadcast to an output shape dimension if the input shape dimension is 1.
 TOSA broadcast requires the rank of both tensors to be the same.
 A RESHAPE can be done to create a compatible tensor with appropriate dimensions of size 1.
 To map indexes in an output tensor to that of an input tensor, see <<Broadcast Helper>>.

 ==== Supported Number Formats

 The following number formats are defined in TOSA.
 The number formats supported by a given operator are listed in its table of supported types.

 .Number formats
 [cols="1,1,1,5"]
 |===
 |Format|Minimum|Maximum|Description

 |bool_t
 | -
 | -
 |Boolean value that is either `true` or `false`. Size implementation defined. The TOSA reference model implements this as int8_t with 0 for `false` and 1 for `true`. All non-zero values are accepted on input as `true`.

 |i4_t
 | -
 | -
 |Signless 4-bit integer type. Will be interpreted as int4_t by all operators

 |int4_t
 | -7
 | +7
 |Signed 4-bit two's-complement value. Excludes -8 to maintain a symmetric about zero range for weights.

 |i8_t
 | -
 | -
 |Signless 8-bit integer value. Will be interpreted as int8_t unless otherwise specified by an operator.

 |int8_t
 | -128
 | +127
 |Signed 8-bit two's-complement value.

 |uint8_t
 | 0
 | 255
 |Unsigned 8-bit integer value.

 |i16_t
 | -
 | -
 |Signless 16-bit integer type. Will be interpreted as int16_t unless otherwise specified by an operator.

 |int16_t
 | -32768
 | +32767
 |Signed 16-bit two's-complement value.

 |uint16_t
 | 0
 | 65535
 |Unsigned 16-bit value.

 |i32_t
 | -
 | -
 |Signless 32-bit integer value. Will be interpreted as int32_t by all operators.

 |int32_t
 | -(1<<31)
 | (1<<31)-1
 |Signed 32-bit two's-complement value.

 |i48_t
 | -
 | -
 |Signless 48-bit integer value. Will be interpreted as int48_t by all operators.

 |int48_t
 | -(1<<47)
 | (1<<47)-1
 |Signed 48-bit two's-complement value.

 |fp16_t
 | -infinity
 | +infinity
 | 16-bit half-precision floating-point defined by <<Other publications>>[1]. +
 Normal values must be supported. +
 Denormal values must either be supported or flushed to zero. +
 Positive and negative infinity must be supported. +
 At least one NaN encoding must be supported. +
 Signed zero must be supported.

 |bf16_t
 | -infinity
 | +infinity
 | 16-bit brain floating-point defined as bits [31:16] of the fp32_t format. +
 Normal values must be supported. +
 Denormal values must either be supported or flushed to zero. +
 Positive and negative infinity must be supported. +
 At least one NaN encoding must be supported. +
 Signed zero must be supported.

 |fp32_t
 | -infinity
 | +infinity
 | 32-bit single-precision floating-point defined by <<Other publications>>[1]. +
 Normal values must be supported. +
 Denormal values must either be supported or flushed to zero. +
 Positive and negative infinity must be supported. +
 At least one NaN encoding must be supported. +
 Signed zero must be supported.

 |fp64_t
 | -infinity
 | + infinity
 | 64-bit double-precision floating-point defined by <<Other publications>>[1]. +
 Normal values must be supported. +
 Denormal values must either be supported or flushed to zero. +
 Positive and negative infinity must be supported. +
 At least one NaN encoding must be supported. +
 Signed zero must be supported.
 |===

 Note: In this specification minimum<type> and maximum<type> will denote the minimum and maximum values of the data as stored in memory (ignoring the zero point).
 The minimum and maximum values for each type is given in the preceeding table.

 Note: Integer number formats smaller than 8 bits may be used provided that the numerical result is the same as using a sequence of 8-bit TOSA operations.
 For example, a convolution with low precision data must equal that of running the convolution at 8 bits and then clipping the result to the peritted output range.
 This ensures that a Base Inference profile TOSA implementation can calculate the same result.

 === Integer Behavior

 TOSA integer inputs and outputs are specified by signless values with the given number of bits.
 Unless otherwise specified, these values will be interpreted as signed twos-complement.
 The pseudocode will use int*_t to indicate use as a signed value and uint*_t to indicate use as an unsigned value.
 If overflow occurs doing integer calculation, the result is unpredictable, as indicated by the REQUIRE checks in the pseudocode for the operators.

 Unsigned 8 and 16-bit values are only allowed in the RESCALE operation, to allow for compatibility with networks which expect unsigned 8-bit or 16-bit tensors for input and output.

 ==== Quantization

 Machine Learning frameworks may represent tensors with a quantized implementation, using integer values to represent the original floating-point numbers.
 TOSA integer operations do not perform any implicit scaling to represent quantized values.
 Required zero point values are passed to the operator as necessary, and will be processed according to the pseudocode for each operator.

 To convert a network containing quantized tensors to TOSA, generate explicit RESCALE operators for any change of quantization scaling.
 This reduces quantized operations to purely integer operations.

 As an example, an ADD between two quantized tensors requires the integer values represent the same range.
 The scale arguments for RESCALE can be calculated to ensure that the resulting tensors represent the same range.
 Then the ADD is performed, and a RESCALE can be used to ensure that the result is scaled properly.

 RESCALE provides support for per-tensor and per-channel scaling values to ensure compatibility with a range of possible quantization implementations.


 ==== Precision scaling

 TOSA uses the RESCALE operation to scale between values with differing precision.
 The RESCALE operator is defined using an integer multiply, add, and shift.
 This guarantees that all TOSA implementations will return the same result for a RESCALE, including those with no support for floating-point numbers.

 This TOSA specification supports two precisions of multiplier: 16-bit and 32-bit.
 The 32-bit multiplier version supports two rounding modes to enable simpler lowering of existing frameworks that use two stage rounding.
 All arithmetic is designed so that it does not overflow a 64-bit accumulator and that the final result fits in 32 bits.
 In particular a 48-bit value can only be scaled with the 16-bit multiplier.

 The apply_scale functions provide a scaling of approximately (multiplier * 2^-shift^).
 The shift and value range is limited to allow a variety of implementations.
 The limit of 62 on shift allows the shift to be decomposed as two right shifts of 31.
 The limit on value allows implementations that left shift the value before the multiply in the case of shifts of 32 or less.
 For example, in the case shift=30 an implementation of the form ((value\<<2) * multiplier + round)>>32 can be used.
 A scaling range of 2^+12^ down to 2^-32^ is supported for both functions with a normalized multiplier.

 For example, in typical usage a scaling of m*2^-n^ where m is a fraction in the
 range 1.0 \<= m < 2.0 can be represented using multiplier=(1<<30)*m, shift=(30+n) for
 apply_scale_32() and multiplier=(1<<14)*m, shift=(14+n) for apply_scale_16().
 The values to achieve a scaling of 1.0 are shift=30, multiplier=1<<30 for apply_scale_32 and shift=14, multiplier=1<<14 for apply_scale_16.

 [source,c++]
 ----
 int32_t apply_scale_32(int32_t value, int32_t multiplier, int8_t shift, bool_t double_round=false) {
     REQUIRE(multiplier >= 0);
     REQUIRE(2 <= shift && shift <= 62);
     REQUIRE(value >= (-1 << (shift - 1)) && value < (1 << (shift - 1)));
     int64_t round = 1 << (shift - 1);
     if (double_round) {
         if (shift > 31 && value >= 0) round += 1<<30;
         if (shift > 31 && value < 0)  round -= 1<<30;
     }
     int64_t result = static_cast<int64_t>(value) * multiplier + round;
     result = result >> shift;
     // result will fit a 32-bit range due to the REQUIRE on value
     return static_cast<int32_t>(result);
 }

 int32_t apply_scale_16(int48_t value, int16_t multipler, int8_t shift) {
     REQUIRE(multiplier >= 0);
     REQUIRE(2 <= shift && shift <= 62);
     int64_t round = (1 << (shift - 1));
     int64_t result = static_cast<int64_t>(value) * multiplier + round;
     result = result >> shift;
     REQUIRE(result >= minimum<int32_t> && result <= maximum<int32_t>);
     return static_cast<int32_t>(result);
 }
 ----

 In some functions, the multiplier and shift are combined into a scale_t structure:

 [source,c++]
 ----
 typedef struct {
     int32_t multiplier;
     int8_t shift;
 } scale_t;
 ----

 In places where a divide is required, we also use the function below to calculate an appropriate scaling value.

 [source,c++]
 ----
 scale_t reciprocal_scale(uint32_t value) {
     REQUIRE(value > 0);
     scale_t scale;
     int32_t k = 32 - count_leading_zeros(value - 1); // (1 << k) / 2 < value <= (1 << k)
     int64_t numerator = ((1 << 30) + 1) << k;
     scale.multiplier = numerator / value; // (1 << 30) <= multiplier < (1 << 31)
     scale.shift = 30 + k;
     return scale;
 }
 ----

 ==== Integer Convolutions

 For the convolution operators, the input is not required to be scaled.
 The integer versions of the convolution operators will subtract the zero point from the integer values as defined for each operator.
 The convolution produces an accumulator output of type int32_t or int48_t.
 This accumulator output is then scaled to the final output range using the RESCALE operator.
 The scale applied in the RESCALE operator should be set to multiplier and shift values such that: multiplier * 2^-shift^ = (input scale * weight scale) / output_scale.
 Here, input_scale, weight_scale and output_scale are the conversion factors from integer to floating-point for the input, weight and output tensor values respectively.
 If per-channel scaling is needed then the per-channel option of the RESCALE operation should be used.

 ==== Integer Elementwise Operators

 When two quantized tensors are used in an operation, they must represent the same numeric range for the result to be valid.
 In this case, TOSA expects that RESCALE operators will be used as necessary to generate 32-bit integer values in a common range.
 There are many valid choices for scale factors and options for the common range.
 TOSA does not impose a requirement on which scale factors and range should be used.
 Compilers generating TOSA sequences should choose a range that allows the operation to be computed without overflow, while allowing the highest possible accuracy of the output.

 ==== General Unary Functions
 General unary functions such as sigmoid(), tanh(), exp() for integer inputs are expressed using a lookup table and interpolation to enable efficient implementation.
 This also allows for other operations with the addition of user-supplied tables (the TABLE operation).
 All table lookups are based on the following reference lookup function that takes as input a table of 513 entries of 16 bits each.

 [source,c++]
 ----
 int32_t apply_lookup_s(int16_t *table, int32_t value)
 {
     int16_t clipped_value = static_cast<int16_t>(apply_clip_s<int32_t>(value, -32768, +32767));
     int32_t index = (clipped_value + 32768) >> 7;
     int32_t fraction = clipped_value & 0x7f;
     int16_t base = table[index];
     int16_t next = table[index+1];
     int32_t slope = next - base;
     REQUIRE(slope >= minimum<int16_t> && slope <= maximum<int16_t>)
     int32_t return_value = (base << 7) + slope * fraction;
     return return_value;	// return interpolated value of 16 + 7 = 23 bits
 }
 ----

 Note that although the table lookup defined here has 16-bit precision, for 8-bit only operations an 8-bit table can be derived by applying the reference function to each of the possible 256 input values.
 The following code constructs a 513-entry table based on a reference function.

 [source,c++]
 ----
 void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t))
 {
     for (int i = -256; i <= 256; i++) {
         int32_t value = (*reference)(i);
         table[i + 256] = static_cast<int16_t>(apply_clip_s<int32_t>(value, -32768, +32767));
     }
 }
 ----

 === Other publications

 The following publications are referred to in this specification, or provide more information:

 . IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008.