telsoa01 | c577f2c | 2018-08-31 09:22:23 +0100 | [diff] [blame] | 1 | HALF-PRECISION FLOATING POINT LIBRARY (Version 1.12.0)
|
| 2 | ------------------------------------------------------
|
| 3 |
|
| 4 | This is a C++ header-only library to provide an IEEE 754 conformant 16-bit
|
| 5 | half-precision floating point type along with corresponding arithmetic
|
| 6 | operators, type conversions and common mathematical functions. It aims for both
|
| 7 | efficiency and ease of use, trying to accurately mimic the behaviour of the
|
| 8 | builtin floating point types at the best performance possible.
|
| 9 |
|
| 10 |
|
| 11 | INSTALLATION AND REQUIREMENTS
|
| 12 | -----------------------------
|
| 13 |
|
| 14 | Comfortably enough, the library consists of just a single header file
|
| 15 | containing all the functionality, which can be directly included by your
|
| 16 | projects, without the neccessity to build anything or link to anything.
|
| 17 |
|
| 18 | Whereas this library is fully C++98-compatible, it can profit from certain
|
| 19 | C++11 features. Support for those features is checked automatically at compile
|
| 20 | (or rather preprocessing) time, but can be explicitly enabled or disabled by
|
| 21 | defining the corresponding preprocessor symbols to either 1 or 0 yourself. This
|
| 22 | is useful when the automatic detection fails (for more exotic implementations)
|
| 23 | or when a feature should be explicitly disabled:
|
| 24 |
|
| 25 | - 'long long' integer type for mathematical functions returning 'long long'
|
| 26 | results (enabled for VC++ 2003 and newer, gcc and clang, overridable with
|
| 27 | 'HALF_ENABLE_CPP11_LONG_LONG').
|
| 28 |
|
| 29 | - Static assertions for extended compile-time checks (enabled for VC++ 2010,
|
| 30 | gcc 4.3, clang 2.9 and newer, overridable with 'HALF_ENABLE_CPP11_STATIC_ASSERT').
|
| 31 |
|
| 32 | - Generalized constant expressions (enabled for VC++ 2015, gcc 4.6, clang 3.1
|
| 33 | and newer, overridable with 'HALF_ENABLE_CPP11_CONSTEXPR').
|
| 34 |
|
| 35 | - noexcept exception specifications (enabled for VC++ 2015, gcc 4.6, clang 3.0
|
| 36 | and newer, overridable with 'HALF_ENABLE_CPP11_NOEXCEPT').
|
| 37 |
|
| 38 | - User-defined literals for half-precision literals to work (enabled for
|
| 39 | VC++ 2015, gcc 4.7, clang 3.1 and newer, overridable with
|
| 40 | 'HALF_ENABLE_CPP11_USER_LITERALS').
|
| 41 |
|
| 42 | - Type traits and template meta-programming features from <type_traits>
|
| 43 | (enabled for VC++ 2010, libstdc++ 4.3, libc++ and newer, overridable with
|
| 44 | 'HALF_ENABLE_CPP11_TYPE_TRAITS').
|
| 45 |
|
| 46 | - Special integer types from <cstdint> (enabled for VC++ 2010, libstdc++ 4.3,
|
| 47 | libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CSTDINT').
|
| 48 |
|
| 49 | - Certain C++11 single-precision mathematical functions from <cmath> for
|
| 50 | an improved implementation of their half-precision counterparts to work
|
| 51 | (enabled for VC++ 2013, libstdc++ 4.3, libc++ and newer, overridable with
|
| 52 | 'HALF_ENABLE_CPP11_CMATH').
|
| 53 |
|
| 54 | - Hash functor 'std::hash' from <functional> (enabled for VC++ 2010,
|
| 55 | libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_HASH').
|
| 56 |
|
| 57 | The library has been tested successfully with Visual C++ 2005-2015, gcc 4.4-4.8
|
| 58 | and clang 3.1. Please contact me if you have any problems, suggestions or even
|
| 59 | just success testing it on other platforms.
|
| 60 |
|
| 61 |
|
| 62 | DOCUMENTATION
|
| 63 | -------------
|
| 64 |
|
| 65 | Here follow some general words about the usage of the library and its
|
| 66 | implementation. For a complete documentation of its iterface look at the
|
| 67 | corresponding website http://half.sourceforge.net. You may also generate the
|
| 68 | complete developer documentation from the library's only include file's doxygen
|
| 69 | comments, but this is more relevant to developers rather than mere users (for
|
| 70 | reasons described below).
|
| 71 |
|
| 72 | BASIC USAGE
|
| 73 |
|
| 74 | To make use of the library just include its only header file half.hpp, which
|
| 75 | defines all half-precision functionality inside the 'half_float' namespace. The
|
| 76 | actual 16-bit half-precision data type is represented by the 'half' type. This
|
| 77 | type behaves like the builtin floating point types as much as possible,
|
| 78 | supporting the usual arithmetic, comparison and streaming operators, which
|
| 79 | makes its use pretty straight-forward:
|
| 80 |
|
| 81 | using half_float::half;
|
| 82 | half a(3.4), b(5);
|
| 83 | half c = a * b;
|
| 84 | c += 3;
|
| 85 | if(c > a)
|
| 86 | std::cout << c << std::endl;
|
| 87 |
|
| 88 | Additionally the 'half_float' namespace also defines half-precision versions
|
| 89 | for all mathematical functions of the C++ standard library, which can be used
|
| 90 | directly through ADL:
|
| 91 |
|
| 92 | half a(-3.14159);
|
| 93 | half s = sin(abs(a));
|
| 94 | long l = lround(s);
|
| 95 |
|
| 96 | You may also specify explicit half-precision literals, since the library
|
| 97 | provides a user-defined literal inside the 'half_float::literal' namespace,
|
| 98 | which you just need to import (assuming support for C++11 user-defined literals):
|
| 99 |
|
| 100 | using namespace half_float::literal;
|
| 101 | half x = 1.0_h;
|
| 102 |
|
| 103 | Furthermore the library provides proper specializations for
|
| 104 | 'std::numeric_limits', defining various implementation properties, and
|
| 105 | 'std::hash' for hashing half-precision numbers (assuming support for C++11
|
| 106 | 'std::hash'). Similar to the corresponding preprocessor symbols from <cmath>
|
| 107 | the library also defines the 'HUGE_VALH' constant and maybe the 'FP_FAST_FMAH'
|
| 108 | symbol.
|
| 109 |
|
| 110 | CONVERSIONS AND ROUNDING
|
| 111 |
|
| 112 | The half is explicitly constructible/convertible from a single-precision float
|
| 113 | argument. Thus it is also explicitly constructible/convertible from any type
|
| 114 | implicitly convertible to float, but constructing it from types like double or
|
| 115 | int will involve the usual warnings arising when implicitly converting those to
|
| 116 | float because of the lost precision. On the one hand those warnings are
|
| 117 | intentional, because converting those types to half neccessarily also reduces
|
| 118 | precision. But on the other hand they are raised for explicit conversions from
|
| 119 | those types, when the user knows what he is doing. So if those warnings keep
|
| 120 | bugging you, then you won't get around first explicitly converting to float
|
| 121 | before converting to half, or use the 'half_cast' described below. In addition
|
| 122 | you can also directly assign float values to halfs.
|
| 123 |
|
| 124 | In contrast to the float-to-half conversion, which reduces precision, the
|
| 125 | conversion from half to float (and thus to any other type implicitly
|
| 126 | convertible from float) is implicit, because all values represetable with
|
| 127 | half-precision are also representable with single-precision. This way the
|
| 128 | half-to-float conversion behaves similar to the builtin float-to-double
|
| 129 | conversion and all arithmetic expressions involving both half-precision and
|
| 130 | single-precision arguments will be of single-precision type. This way you can
|
| 131 | also directly use the mathematical functions of the C++ standard library,
|
| 132 | though in this case you will invoke the single-precision versions which will
|
| 133 | also return single-precision values, which is (even if maybe performing the
|
| 134 | exact same computation, see below) not as conceptually clean when working in a
|
| 135 | half-precision environment.
|
| 136 |
|
| 137 | The default rounding mode for conversions from float to half uses truncation
|
| 138 | (round toward zero, but mapping overflows to infinity) for rounding values not
|
| 139 | representable exactly in half-precision. This is the fastest rounding possible
|
| 140 | and is usually sufficient. But by redefining the 'HALF_ROUND_STYLE'
|
| 141 | preprocessor symbol (before including half.hpp) this default can be overridden
|
| 142 | with one of the other standard rounding modes using their respective constants
|
| 143 | or the equivalent values of 'std::float_round_style' (it can even be
|
| 144 | synchronized with the underlying single-precision implementation by defining it
|
| 145 | to 'std::numeric_limits<float>::round_style'):
|
| 146 |
|
| 147 | - 'std::round_indeterminate' or -1 for the fastest rounding (default).
|
| 148 |
|
| 149 | - 'std::round_toward_zero' or 0 for rounding toward zero.
|
| 150 |
|
| 151 | - std::round_to_nearest' or 1 for rounding to the nearest value.
|
| 152 |
|
| 153 | - std::round_toward_infinity' or 2 for rounding toward positive infinity.
|
| 154 |
|
| 155 | - std::round_toward_neg_infinity' or 3 for rounding toward negative infinity.
|
| 156 |
|
| 157 | In addition to changing the overall default rounding mode one can also use the
|
| 158 | 'half_cast'. This converts between half and any built-in arithmetic type using
|
| 159 | a configurable rounding mode (or the default rounding mode if none is
|
| 160 | specified). In addition to a configurable rounding mode, 'half_cast' has
|
| 161 | another big difference to a mere 'static_cast': Any conversions are performed
|
| 162 | directly using the given rounding mode, without any intermediate conversion
|
| 163 | to/from 'float'. This is especially relevant for conversions to integer types,
|
| 164 | which don't necessarily truncate anymore. But also for conversions from
|
| 165 | 'double' or 'long double' this may produce more precise results than a
|
| 166 | pre-conversion to 'float' using the single-precision implementation's current
|
| 167 | rounding mode would.
|
| 168 |
|
| 169 | half a = half_cast<half>(4.2);
|
| 170 | half b = half_cast<half,std::numeric_limits<float>::round_style>(4.2f);
|
| 171 | assert( half_cast<int, std::round_to_nearest>( 0.7_h ) == 1 );
|
| 172 | assert( half_cast<half,std::round_toward_zero>( 4097 ) == 4096.0_h );
|
| 173 | assert( half_cast<half,std::round_toward_infinity>( 4097 ) == 4100.0_h );
|
| 174 | assert( half_cast<half,std::round_toward_infinity>( std::numeric_limits<double>::min() ) > 0.0_h );
|
| 175 |
|
| 176 | When using round to nearest (either as default or through 'half_cast') ties are
|
| 177 | by default resolved by rounding them away from zero (and thus equal to the
|
| 178 | behaviour of the 'round' function). But by redefining the
|
| 179 | 'HALF_ROUND_TIES_TO_EVEN' preprocessor symbol to 1 (before including half.hpp)
|
| 180 | this default can be changed to the slightly slower but less biased and more
|
| 181 | IEEE-conformant behaviour of rounding half-way cases to the nearest even value.
|
| 182 |
|
| 183 | #define HALF_ROUND_TIES_TO_EVEN 1
|
| 184 | #include <half.hpp>
|
| 185 | ...
|
| 186 | assert( half_cast<int,std::round_to_nearest>(3.5_h)
|
| 187 | == half_cast<int,std::round_to_nearest>(4.5_h) );
|
| 188 |
|
| 189 | IMPLEMENTATION
|
| 190 |
|
| 191 | For performance reasons (and ease of implementation) many of the mathematical
|
| 192 | functions provided by the library as well as all arithmetic operations are
|
| 193 | actually carried out in single-precision under the hood, calling to the C++
|
| 194 | standard library implementations of those functions whenever appropriate,
|
| 195 | meaning the arguments are converted to floats and the result back to half. But
|
| 196 | to reduce the conversion overhead as much as possible any temporary values
|
| 197 | inside of lengthy expressions are kept in single-precision as long as possible,
|
| 198 | while still maintaining a strong half-precision type to the outside world. Only
|
| 199 | when finally assigning the value to a half or calling a function that works
|
| 200 | directly on halfs is the actual conversion done (or never, when further
|
| 201 | converting the result to float.
|
| 202 |
|
| 203 | This approach has two implications. First of all you have to treat the
|
| 204 | library's documentation at http://half.sourceforge.net as a simplified version,
|
| 205 | describing the behaviour of the library as if implemented this way. The actual
|
| 206 | argument and return types of functions and operators may involve other internal
|
| 207 | types (feel free to generate the exact developer documentation from the Doxygen
|
| 208 | comments in the library's header file if you really need to). But nevertheless
|
| 209 | the behaviour is exactly like specified in the documentation. The other
|
| 210 | implication is, that in the presence of rounding errors or over-/underflows
|
| 211 | arithmetic expressions may produce different results when compared to
|
| 212 | converting to half-precision after each individual operation:
|
| 213 |
|
| 214 | half a = std::numeric_limits<half>::max() * 2.0_h / 2.0_h; // a = MAX
|
| 215 | half b = half(std::numeric_limits<half>::max() * 2.0_h) / 2.0_h; // b = INF
|
| 216 | assert( a != b );
|
| 217 |
|
| 218 | But this should only be a problem in very few cases. One last word has to be
|
| 219 | said when talking about performance. Even with its efforts in reducing
|
| 220 | conversion overhead as much as possible, the software half-precision
|
| 221 | implementation can most probably not beat the direct use of single-precision
|
| 222 | computations. Usually using actual float values for all computations and
|
| 223 | temproraries and using halfs only for storage is the recommended way. On the
|
| 224 | one hand this somehow makes the provided mathematical functions obsolete
|
| 225 | (especially in light of the implicit conversion from half to float), but
|
| 226 | nevertheless the goal of this library was to provide a complete and
|
| 227 | conceptually clean half-precision implementation, to which the standard
|
| 228 | mathematical functions belong, even if usually not needed.
|
| 229 |
|
| 230 | IEEE CONFORMANCE
|
| 231 |
|
| 232 | The half type uses the standard IEEE representation with 1 sign bit, 5 exponent
|
| 233 | bits and 10 mantissa bits (11 when counting the hidden bit). It supports all
|
| 234 | types of special values, like subnormal values, infinity and NaNs. But there
|
| 235 | are some limitations to the complete conformance to the IEEE 754 standard:
|
| 236 |
|
| 237 | - The implementation does not differentiate between signalling and quiet
|
| 238 | NaNs, this means operations on halfs are not specified to trap on
|
| 239 | signalling NaNs (though they may, see last point).
|
| 240 |
|
| 241 | - Though arithmetic operations are internally rounded to single-precision
|
| 242 | using the underlying single-precision implementation's current rounding
|
| 243 | mode, those values are then converted to half-precision using the default
|
| 244 | half-precision rounding mode (changed by defining 'HALF_ROUND_STYLE'
|
| 245 | accordingly). This mixture of rounding modes is also the reason why
|
| 246 | 'std::numeric_limits<half>::round_style' may actually return
|
| 247 | 'std::round_indeterminate' when half- and single-precision rounding modes
|
| 248 | don't match.
|
| 249 |
|
| 250 | - Because of internal truncation it may also be that certain single-precision
|
| 251 | NaNs will be wrongly converted to half-precision infinity, though this is
|
| 252 | very unlikely to happen, since most single-precision implementations don't
|
| 253 | tend to only set the lowest bits of a NaN mantissa.
|
| 254 |
|
| 255 | - The implementation does not provide any floating point exceptions, thus
|
| 256 | arithmetic operations or mathematical functions are not specified to invoke
|
| 257 | proper floating point exceptions. But due to many functions implemented in
|
| 258 | single-precision, those may still invoke floating point exceptions of the
|
| 259 | underlying single-precision implementation.
|
| 260 |
|
| 261 | Some of those points could have been circumvented by controlling the floating
|
| 262 | point environment using <cfenv> or implementing a similar exception mechanism.
|
| 263 | But this would have required excessive runtime checks giving two high an impact
|
| 264 | on performance for something that is rarely ever needed. If you really need to
|
| 265 | rely on proper floating point exceptions, it is recommended to explicitly
|
| 266 | perform computations using the built-in floating point types to be on the safe
|
| 267 | side. In the same way, if you really need to rely on a particular rounding
|
| 268 | behaviour, it is recommended to either use single-precision computations and
|
| 269 | explicitly convert the result to half-precision using 'half_cast' and
|
| 270 | specifying the desired rounding mode, or synchronize the default half-precision
|
| 271 | rounding mode to the rounding mode of the single-precision implementation (most
|
| 272 | likely 'HALF_ROUND_STYLE=1', 'HALF_ROUND_TIES_TO_EVEN=1'). But this is really
|
| 273 | considered an expert-scenario that should be used only when necessary, since
|
| 274 | actually working with half-precision usually comes with a certain
|
| 275 | tolerance/ignorance of exactness considerations and proper rounding comes with
|
| 276 | a certain performance cost.
|
| 277 |
|
| 278 |
|
| 279 | CREDITS AND CONTACT
|
| 280 | -------------------
|
| 281 |
|
| 282 | This library is developed by CHRISTIAN RAU and released under the MIT License
|
| 283 | (see LICENSE.txt). If you have any questions or problems with it, feel free to
|
| 284 | contact me at rauy@users.sourceforge.net.
|
| 285 |
|
| 286 | Additional credit goes to JEROEN VAN DER ZIJP for his paper on "Fast Half Float
|
| 287 | Conversions", whose algorithms have been used in the library for converting
|
| 288 | between half-precision and single-precision values.
|