James Ward | 485a11d | 2022-08-05 13:48:37 +0100 | [diff] [blame] | 1 | HALF-PRECISION FLOATING-POINT LIBRARY (Version 2.2.0)
|
| 2 | -----------------------------------------------------
|
| 3 |
|
| 4 | This is a C++ header-only library to provide an IEEE 754 conformant 16-bit
|
| 5 | half-precision floating-point type along with corresponding arithmetic
|
| 6 | operators, type conversions and common mathematical functions. It aims for both
|
| 7 | efficiency and ease of use, trying to accurately mimic the behaviour of the
|
| 8 | built-in floating-point types at the best performance possible.
|
| 9 |
|
| 10 |
|
| 11 | INSTALLATION AND REQUIREMENTS
|
| 12 | -----------------------------
|
| 13 |
|
| 14 | Conveniently, the library consists of just a single header file containing all
|
| 15 | the functionality, which can be directly included by your projects, without the
|
| 16 | neccessity to build anything or link to anything.
|
| 17 |
|
| 18 | Whereas this library is fully C++98-compatible, it can profit from certain
|
| 19 | C++11 features. Support for those features is checked automatically at compile
|
| 20 | (or rather preprocessing) time, but can be explicitly enabled or disabled by
|
| 21 | predefining the corresponding preprocessor symbols to either 1 or 0 yourself
|
| 22 | before including half.hpp. This is useful when the automatic detection fails
|
| 23 | (for more exotic implementations) or when a feature should be explicitly
|
| 24 | disabled:
|
| 25 |
|
| 26 | - 'long long' integer type for mathematical functions returning 'long long'
|
| 27 | results (enabled for VC++ 2003 and icc 11.1 and newer, gcc and clang,
|
| 28 | overridable with 'HALF_ENABLE_CPP11_LONG_LONG').
|
| 29 |
|
| 30 | - Static assertions for extended compile-time checks (enabled for VC++ 2010,
|
| 31 | gcc 4.3, clang 2.9, icc 11.1 and newer, overridable with
|
| 32 | 'HALF_ENABLE_CPP11_STATIC_ASSERT').
|
| 33 |
|
| 34 | - Generalized constant expressions (enabled for VC++ 2015, gcc 4.6, clang 3.1,
|
| 35 | icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_CONSTEXPR').
|
| 36 |
|
| 37 | - noexcept exception specifications (enabled for VC++ 2015, gcc 4.6,
|
| 38 | clang 3.0, icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_NOEXCEPT').
|
| 39 |
|
| 40 | - User-defined literals for half-precision literals to work (enabled for
|
| 41 | VC++ 2015, gcc 4.7, clang 3.1, icc 15.0 and newer, overridable with
|
| 42 | 'HALF_ENABLE_CPP11_USER_LITERALS').
|
| 43 |
|
| 44 | - Thread-local storage for per-thread floating-point exception flags (enabled
|
| 45 | for VC++ 2015, gcc 4.8, clang 3.3, icc 15.0 and newer, overridable with
|
| 46 | 'HALF_ENABLE_CPP11_THREAD_LOCAL').
|
| 47 |
|
| 48 | - Type traits and template meta-programming features from <type_traits>
|
| 49 | (enabled for VC++ 2010, libstdc++ 4.3, libc++ and newer, overridable with
|
| 50 | 'HALF_ENABLE_CPP11_TYPE_TRAITS').
|
| 51 |
|
| 52 | - Special integer types from <cstdint> (enabled for VC++ 2010, libstdc++ 4.3,
|
| 53 | libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CSTDINT').
|
| 54 |
|
| 55 | - Certain C++11 single-precision mathematical functions from <cmath> for
|
| 56 | floating-point classification during conversions from higher precision types
|
| 57 | (enabled for VC++ 2013, libstdc++ 4.3, libc++ and newer, overridable with
|
| 58 | 'HALF_ENABLE_CPP11_CMATH').
|
| 59 |
|
| 60 | - Floating-point environment control from <cfenv> for possible exception
|
| 61 | propagation to the built-in floating-point platform (enabled for VC++ 2013,
|
| 62 | libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CFENV').
|
| 63 |
|
| 64 | - Hash functor 'std::hash' from <functional> (enabled for VC++ 2010,
|
| 65 | libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_HASH').
|
| 66 |
|
| 67 | The library has been tested successfully with Visual C++ 2005-2015, gcc 4-8
|
| 68 | and clang 3-8 on 32- and 64-bit x86 systems. Please contact me if you have any
|
| 69 | problems, suggestions or even just success testing it on other platforms.
|
| 70 |
|
| 71 |
|
| 72 | DOCUMENTATION
|
| 73 | -------------
|
| 74 |
|
| 75 | What follows are some general words about the usage of the library and its
|
| 76 | implementation. For a complete documentation of its interface consult the
|
| 77 | corresponding website http://half.sourceforge.net. You may also generate the
|
| 78 | complete developer documentation from the library's only include file's doxygen
|
| 79 | comments, but this is more relevant to developers rather than mere users.
|
| 80 |
|
| 81 | BASIC USAGE
|
| 82 |
|
| 83 | To make use of the library just include its only header file half.hpp, which
|
| 84 | defines all half-precision functionality inside the 'half_float' namespace. The
|
| 85 | actual 16-bit half-precision data type is represented by the 'half' type, which
|
| 86 | uses the standard IEEE representation with 1 sign bit, 5 exponent bits and 11
|
| 87 | mantissa bits (including the hidden bit) and supports all types of special
|
| 88 | values, like subnormal values, infinity and NaNs. This type behaves like the
|
| 89 | built-in floating-point types as much as possible, supporting the usual
|
| 90 | arithmetic, comparison and streaming operators, which makes its use pretty
|
| 91 | straight-forward:
|
| 92 |
|
| 93 | using half_float::half;
|
| 94 | half a(3.4), b(5);
|
| 95 | half c = a * b;
|
| 96 | c += 3;
|
| 97 | if(c > a)
|
| 98 | std::cout << c << std::endl;
|
| 99 |
|
| 100 | Additionally the 'half_float' namespace also defines half-precision versions
|
| 101 | for all mathematical functions of the C++ standard library, which can be used
|
| 102 | directly through ADL:
|
| 103 |
|
| 104 | half a(-3.14159);
|
| 105 | half s = sin(abs(a));
|
| 106 | long l = lround(s);
|
| 107 |
|
| 108 | You may also specify explicit half-precision literals, since the library
|
| 109 | provides a user-defined literal inside the 'half_float::literal' namespace,
|
| 110 | which you just need to import (assuming support for C++11 user-defined literals):
|
| 111 |
|
| 112 | using namespace half_float::literal;
|
| 113 | half x = 1.0_h;
|
| 114 |
|
| 115 | Furthermore the library provides proper specializations for
|
| 116 | 'std::numeric_limits', defining various implementation properties, and
|
| 117 | 'std::hash' for hashing half-precision numbers (assuming support for C++11
|
| 118 | 'std::hash'). Similar to the corresponding preprocessor symbols from <cmath>
|
| 119 | the library also defines the 'HUGE_VALH' constant and maybe the 'FP_FAST_FMAH'
|
| 120 | symbol.
|
| 121 |
|
| 122 | CONVERSIONS AND ROUNDING
|
| 123 |
|
| 124 | The half is explicitly constructible/convertible from a single-precision float
|
| 125 | argument. Thus it is also explicitly constructible/convertible from any type
|
| 126 | implicitly convertible to float, but constructing it from types like double or
|
| 127 | int will involve the usual warnings arising when implicitly converting those to
|
| 128 | float because of the lost precision. On the one hand those warnings are
|
| 129 | intentional, because converting those types to half neccessarily also reduces
|
| 130 | precision. But on the other hand they are raised for explicit conversions from
|
| 131 | those types, when the user knows what he is doing. So if those warnings keep
|
| 132 | bugging you, then you won't get around first explicitly converting to float
|
| 133 | before converting to half, or use the 'half_cast' described below. In addition
|
| 134 | you can also directly assign float values to halfs.
|
| 135 |
|
| 136 | In contrast to the float-to-half conversion, which reduces precision, the
|
| 137 | conversion from half to float (and thus to any other type implicitly
|
| 138 | convertible from float) is implicit, because all values represetable with
|
| 139 | half-precision are also representable with single-precision. This way the
|
| 140 | half-to-float conversion behaves similar to the builtin float-to-double
|
| 141 | conversion and all arithmetic expressions involving both half-precision and
|
| 142 | single-precision arguments will be of single-precision type. This way you can
|
| 143 | also directly use the mathematical functions of the C++ standard library,
|
| 144 | though in this case you will invoke the single-precision versions which will
|
| 145 | also return single-precision values, which is (even if maybe performing the
|
| 146 | exact same computation, see below) not as conceptually clean when working in a
|
| 147 | half-precision environment.
|
| 148 |
|
| 149 | The default rounding mode for conversions between half and more precise types
|
| 150 | as well as for rounding results of arithmetic operations and mathematical
|
| 151 | functions rounds to the nearest representable value. But by predefining the
|
| 152 | 'HALF_ROUND_STYLE' preprocessor symbol this default can be overridden with one
|
| 153 | of the other standard rounding modes using their respective constants or the
|
| 154 | equivalent values of 'std::float_round_style' (it can even be synchronized with
|
| 155 | the built-in single-precision implementation by defining it to
|
| 156 | 'std::numeric_limits<float>::round_style'):
|
| 157 |
|
| 158 | - 'std::round_indeterminate' (-1) for the fastest rounding.
|
| 159 |
|
| 160 | - 'std::round_toward_zero' (0) for rounding toward zero.
|
| 161 |
|
| 162 | - 'std::round_to_nearest' (1) for rounding to the nearest value (default).
|
| 163 |
|
| 164 | - 'std::round_toward_infinity' (2) for rounding toward positive infinity.
|
| 165 |
|
| 166 | - 'std::round_toward_neg_infinity' (3) for rounding toward negative infinity.
|
| 167 |
|
| 168 | In addition to changing the overall default rounding mode one can also use the
|
| 169 | 'half_cast'. This converts between half and any built-in arithmetic type using
|
| 170 | a configurable rounding mode (or the default rounding mode if none is
|
| 171 | specified). In addition to a configurable rounding mode, 'half_cast' has
|
| 172 | another big difference to a mere 'static_cast': Any conversions are performed
|
| 173 | directly using the given rounding mode, without any intermediate conversion
|
| 174 | to/from 'float'. This is especially relevant for conversions to integer types,
|
| 175 | which don't necessarily truncate anymore. But also for conversions from
|
| 176 | 'double' or 'long double' this may produce more precise results than a
|
| 177 | pre-conversion to 'float' using the single-precision implementation's current
|
| 178 | rounding mode would.
|
| 179 |
|
| 180 | half a = half_cast<half>(4.2);
|
| 181 | half b = half_cast<half,std::numeric_limits<float>::round_style>(4.2f);
|
| 182 | assert( half_cast<int, std::round_to_nearest>( 0.7_h ) == 1 );
|
| 183 | assert( half_cast<half,std::round_toward_zero>( 4097 ) == 4096.0_h );
|
| 184 | assert( half_cast<half,std::round_toward_infinity>( 4097 ) == 4100.0_h );
|
| 185 | assert( half_cast<half,std::round_toward_infinity>( std::numeric_limits<double>::min() ) > 0.0_h );
|
| 186 |
|
| 187 | ACCURACY AND PERFORMANCE
|
| 188 |
|
| 189 | From version 2.0 onward the library is implemented without employing the
|
| 190 | underlying floating-point implementation of the system (except for conversions,
|
| 191 | of course), providing an entirely self-contained half-precision implementation
|
| 192 | with results independent from the system's existing single- or double-precision
|
| 193 | implementation and its rounding behaviour.
|
| 194 |
|
| 195 | As to accuracy, many of the operators and functions provided by this library
|
| 196 | are exact to rounding for all rounding modes, i.e. the error to the exact
|
| 197 | result is at most 0.5 ULP (unit in the last place) for rounding to nearest and
|
| 198 | less than 1 ULP for all other rounding modes. This holds for all the operations
|
| 199 | required by the IEEE 754 standard and many more. Specifically the following
|
| 200 | functions might exhibit a deviation from the correctly rounded exact result by
|
| 201 | 1 ULP for a select few input values: 'expm1', 'log1p', 'pow', 'atan2', 'erf',
|
| 202 | 'erfc', 'lgamma', 'tgamma' (for more details see the documentation of the
|
| 203 | individual functions). All other functions and operators are always exact to
|
| 204 | rounding or independent of the rounding mode altogether.
|
| 205 |
|
| 206 | The increased IEEE-conformance and cleanliness of this implementation comes
|
| 207 | with a certain performance cost compared to doing computations and mathematical
|
| 208 | functions in hardware-accelerated single-precision. On average and depending on
|
| 209 | the platform, the arithemtic operators are about 75% as fast and the
|
| 210 | mathematical functions about 33-50% as fast as performing the corresponding
|
| 211 | operations in single-precision and converting between the inputs and outputs.
|
| 212 | However, directly computing with half-precision values is a rather rare
|
| 213 | use-case and usually using actual 'float' values for all computations and
|
| 214 | temproraries and using 'half's only for storage is the recommended way. But
|
| 215 | nevertheless the goal of this library was to provide a complete and
|
| 216 | conceptually clean IEEE-confromant half-precision implementation and in the few
|
| 217 | cases when you do need to compute directly in half-precision you do so for a
|
| 218 | reason and want accurate results.
|
| 219 |
|
| 220 | If necessary, this internal implementation can be overridden by predefining the
|
| 221 | 'HALF_ARITHMETIC_TYPE' preprocessor symbol to one of the built-in
|
| 222 | floating-point types ('float', 'double' or 'long double'), which will cause the
|
| 223 | library to use this type for computing arithmetic operations and mathematical
|
| 224 | functions (if available). However, due to using the platform's floating-point
|
| 225 | implementation (and its rounding behaviour) internally, this might cause
|
| 226 | results to deviate from the specified half-precision rounding mode. It will of
|
| 227 | course also inhibit the automatic exception detection described below.
|
| 228 |
|
| 229 | The conversion operations between half-precision and single-precision types can
|
| 230 | also make use of the F16C extension for x86 processors by using the
|
| 231 | corresponding compiler intrinsics from <immintrin.h>. Support for this is
|
| 232 | checked at compile-time by looking for the '__F16C__' macro which at least gcc
|
| 233 | and clang define based on the target platform. It can also be enabled manually
|
| 234 | by predefining the 'HALF_ENABLE_F16C_INTRINSICS' preprocessor symbol to 1, or 0
|
| 235 | for explicitly disabling it. However, this will directly use the corresponding
|
| 236 | intrinsics for conversion without checking if they are available at runtime
|
| 237 | (possibly crashing if they are not), so make sure they are supported on the
|
| 238 | target platform before enabling this.
|
| 239 |
|
| 240 | EXCEPTION HANDLING
|
| 241 |
|
| 242 | The half-precision implementation supports all 5 required floating-point
|
| 243 | exceptions from the IEEE standard to indicate erroneous inputs or inexact
|
| 244 | results during operations. These are represented by exception flags which
|
| 245 | actually use the same values as the corresponding 'FE_...' flags defined in
|
| 246 | C++11's <cfenv> header if supported, specifically:
|
| 247 |
|
| 248 | - 'FE_INVALID' for invalid inputs to an operation.
|
| 249 | - 'FE_DIVBYZERO' for finite inputs producing infinite results.
|
| 250 | - 'FE_OVERFLOW' if a result is too large to represent finitely.
|
| 251 | - 'FE_UNDERFLOW' for a subnormal or zero result after rounding.
|
| 252 | - 'FE_INEXACT' if a result needed rounding to be representable.
|
| 253 | - 'FE_ALL_EXCEPT' as a convenient OR of all possible exception flags.
|
| 254 |
|
| 255 | The internal exception flag state will start with all flags cleared and is
|
| 256 | maintained per thread if C++11 thread-local storage is supported, otherwise it
|
| 257 | will be maintained globally and will theoretically NOT be thread-safe (while
|
| 258 | practically being as thread-safe as a simple integer variable can be). These
|
| 259 | flags can be managed explicitly using the library's error handling functions,
|
| 260 | which again try to mimic the built-in functions for handling floating-point
|
| 261 | exceptions from <cfenv>. You can clear them with 'feclearexcept' (which is the
|
| 262 | only way a flag can be cleared), test them with 'fetestexcept', explicitly
|
| 263 | raise errors with 'feraiseexcept' and save and restore their state using
|
| 264 | 'fegetexceptflag' and 'fesetexceptflag'. You can also throw corresponding C++
|
| 265 | exceptions based on the current flag state using 'fethrowexcept'.
|
| 266 |
|
| 267 | However, any automatic exception detection and handling during half-precision
|
| 268 | operations and functions is DISABLED by default, since it comes with a minor
|
| 269 | performance overhead due to runtime checks, and reacting to IEEE floating-point
|
| 270 | exceptions is rarely ever needed in application code. But the library fully
|
| 271 | supports IEEE-conformant detection of floating-point exceptions and various
|
| 272 | ways for handling them, which can be enabled by pre-defining the corresponding
|
| 273 | preprocessor symbols to 1. They can be enabled individually or all at once and
|
| 274 | they will be processed in the order they are listed here:
|
| 275 |
|
| 276 | - 'HALF_ERRHANDLING_FLAGS' sets the internal exception flags described above
|
| 277 | whenever the corresponding exception occurs.
|
| 278 | - 'HALF_ERRHANDLING_ERRNO' sets the value of 'errno' from <cerrno> similar to
|
| 279 | the behaviour of the built-in floating-point types when 'MATH_ERRNO' is used.
|
| 280 | - 'HALF_ERRHANDLING_FENV' will propagate exceptions to the built-in
|
| 281 | floating-point implementation using 'std::feraiseexcept' if support for
|
| 282 | C++11 floating-point control is enabled. However, this does not synchronize
|
| 283 | exceptions: neither will clearing propagate nor will it work in reverse.
|
| 284 | - 'HALF_ERRHANDLING_THROW_...' can be defined to a string literal which will
|
| 285 | be used as description message for a C++ exception that is thrown whenever
|
| 286 | a 'FE_...' exception occurs, similar to the behaviour of 'fethrowexcept'.
|
| 287 |
|
| 288 | If any of the above error handling is activated, non-quiet operations on
|
| 289 | half-precision values will also raise a 'FE_INVALID' exception whenever
|
| 290 | they encounter a signaling NaN value, in addition to transforming the value
|
| 291 | into a quiet NaN. If error handling is disabled, signaling NaNs will be
|
| 292 | treated like quiet NaNs (while still getting explicitly quieted if propagated
|
| 293 | to the result). There can also be additional treatment of overflow and
|
| 294 | underflow errors after they have been processed as above, which is ENABLED by
|
| 295 | default (but of course only takes effect if any other exception handling is
|
| 296 | activated) unless overridden by pre-defining the corresponding preprocessor
|
| 297 | symbol to 0:
|
| 298 |
|
| 299 | - 'HALF_ERRHANDLING_OVERFLOW_TO_INEXACT' will cause overflow errors to also
|
| 300 | raise a 'FE_INEXACT' exception.
|
| 301 | - 'HALF_ERRHANDLING_UNDERFLOW_TO_INEXACT' will cause underflow errors to also
|
| 302 | raise a 'FE_INEXACT' exception. This will also slightly change the
|
| 303 | behaviour of the underflow exception, which will ONLY be raised if the
|
| 304 | result is actually inexact due to underflow. If this is disabled, underflow
|
| 305 | exceptions will be raised for ANY (possibly exact) subnormal result.
|
| 306 |
|
| 307 |
|
| 308 | CREDITS AND CONTACT
|
| 309 | -------------------
|
| 310 |
|
| 311 | This library is developed by CHRISTIAN RAU and released under the MIT License
|
| 312 | (see LICENSE.txt). If you have any questions or problems with it, feel free to
|
| 313 | contact me at rauy@users.sourceforge.net.
|
| 314 |
|
| 315 | Additional credit goes to JEROEN VAN DER ZIJP for his paper on "Fast Half Float
|
| 316 | Conversions", whose algorithms have been used in the library for converting
|
| 317 | between half-precision and single-precision values.
|