blob: 3dd0d1c2d83813c11c9a6150a18fbae26587c8d2 [file] [log] [blame]
James Ward485a11d2022-08-05 13:48:37 +01001HALF-PRECISION FLOATING-POINT LIBRARY (Version 2.2.0)
2-----------------------------------------------------
3
4This is a C++ header-only library to provide an IEEE 754 conformant 16-bit
5half-precision floating-point type along with corresponding arithmetic
6operators, type conversions and common mathematical functions. It aims for both
7efficiency and ease of use, trying to accurately mimic the behaviour of the
8built-in floating-point types at the best performance possible.
9
10
11INSTALLATION AND REQUIREMENTS
12-----------------------------
13
14Conveniently, the library consists of just a single header file containing all
15the functionality, which can be directly included by your projects, without the
16neccessity to build anything or link to anything.
17
18Whereas this library is fully C++98-compatible, it can profit from certain
19C++11 features. Support for those features is checked automatically at compile
20(or rather preprocessing) time, but can be explicitly enabled or disabled by
21predefining the corresponding preprocessor symbols to either 1 or 0 yourself
22before including half.hpp. This is useful when the automatic detection fails
23(for more exotic implementations) or when a feature should be explicitly
24disabled:
25
26 - 'long long' integer type for mathematical functions returning 'long long'
27 results (enabled for VC++ 2003 and icc 11.1 and newer, gcc and clang,
28 overridable with 'HALF_ENABLE_CPP11_LONG_LONG').
29
30 - Static assertions for extended compile-time checks (enabled for VC++ 2010,
31 gcc 4.3, clang 2.9, icc 11.1 and newer, overridable with
32 'HALF_ENABLE_CPP11_STATIC_ASSERT').
33
34 - Generalized constant expressions (enabled for VC++ 2015, gcc 4.6, clang 3.1,
35 icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_CONSTEXPR').
36
37 - noexcept exception specifications (enabled for VC++ 2015, gcc 4.6,
38 clang 3.0, icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_NOEXCEPT').
39
40 - User-defined literals for half-precision literals to work (enabled for
41 VC++ 2015, gcc 4.7, clang 3.1, icc 15.0 and newer, overridable with
42 'HALF_ENABLE_CPP11_USER_LITERALS').
43
44 - Thread-local storage for per-thread floating-point exception flags (enabled
45 for VC++ 2015, gcc 4.8, clang 3.3, icc 15.0 and newer, overridable with
46 'HALF_ENABLE_CPP11_THREAD_LOCAL').
47
48 - Type traits and template meta-programming features from <type_traits>
49 (enabled for VC++ 2010, libstdc++ 4.3, libc++ and newer, overridable with
50 'HALF_ENABLE_CPP11_TYPE_TRAITS').
51
52 - Special integer types from <cstdint> (enabled for VC++ 2010, libstdc++ 4.3,
53 libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CSTDINT').
54
55 - Certain C++11 single-precision mathematical functions from <cmath> for
56 floating-point classification during conversions from higher precision types
57 (enabled for VC++ 2013, libstdc++ 4.3, libc++ and newer, overridable with
58 'HALF_ENABLE_CPP11_CMATH').
59
60 - Floating-point environment control from <cfenv> for possible exception
61 propagation to the built-in floating-point platform (enabled for VC++ 2013,
62 libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CFENV').
63
64 - Hash functor 'std::hash' from <functional> (enabled for VC++ 2010,
65 libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_HASH').
66
67The library has been tested successfully with Visual C++ 2005-2015, gcc 4-8
68and clang 3-8 on 32- and 64-bit x86 systems. Please contact me if you have any
69problems, suggestions or even just success testing it on other platforms.
70
71
72DOCUMENTATION
73-------------
74
75What follows are some general words about the usage of the library and its
76implementation. For a complete documentation of its interface consult the
77corresponding website http://half.sourceforge.net. You may also generate the
78complete developer documentation from the library's only include file's doxygen
79comments, but this is more relevant to developers rather than mere users.
80
81BASIC USAGE
82
83To make use of the library just include its only header file half.hpp, which
84defines all half-precision functionality inside the 'half_float' namespace. The
85actual 16-bit half-precision data type is represented by the 'half' type, which
86uses the standard IEEE representation with 1 sign bit, 5 exponent bits and 11
87mantissa bits (including the hidden bit) and supports all types of special
88values, like subnormal values, infinity and NaNs. This type behaves like the
89built-in floating-point types as much as possible, supporting the usual
90arithmetic, comparison and streaming operators, which makes its use pretty
91straight-forward:
92
93 using half_float::half;
94 half a(3.4), b(5);
95 half c = a * b;
96 c += 3;
97 if(c > a)
98 std::cout << c << std::endl;
99
100Additionally the 'half_float' namespace also defines half-precision versions
101for all mathematical functions of the C++ standard library, which can be used
102directly through ADL:
103
104 half a(-3.14159);
105 half s = sin(abs(a));
106 long l = lround(s);
107
108You may also specify explicit half-precision literals, since the library
109provides a user-defined literal inside the 'half_float::literal' namespace,
110which you just need to import (assuming support for C++11 user-defined literals):
111
112 using namespace half_float::literal;
113 half x = 1.0_h;
114
115Furthermore the library provides proper specializations for
116'std::numeric_limits', defining various implementation properties, and
117'std::hash' for hashing half-precision numbers (assuming support for C++11
118'std::hash'). Similar to the corresponding preprocessor symbols from <cmath>
119the library also defines the 'HUGE_VALH' constant and maybe the 'FP_FAST_FMAH'
120symbol.
121
122CONVERSIONS AND ROUNDING
123
124The half is explicitly constructible/convertible from a single-precision float
125argument. Thus it is also explicitly constructible/convertible from any type
126implicitly convertible to float, but constructing it from types like double or
127int will involve the usual warnings arising when implicitly converting those to
128float because of the lost precision. On the one hand those warnings are
129intentional, because converting those types to half neccessarily also reduces
130precision. But on the other hand they are raised for explicit conversions from
131those types, when the user knows what he is doing. So if those warnings keep
132bugging you, then you won't get around first explicitly converting to float
133before converting to half, or use the 'half_cast' described below. In addition
134you can also directly assign float values to halfs.
135
136In contrast to the float-to-half conversion, which reduces precision, the
137conversion from half to float (and thus to any other type implicitly
138convertible from float) is implicit, because all values represetable with
139half-precision are also representable with single-precision. This way the
140half-to-float conversion behaves similar to the builtin float-to-double
141conversion and all arithmetic expressions involving both half-precision and
142single-precision arguments will be of single-precision type. This way you can
143also directly use the mathematical functions of the C++ standard library,
144though in this case you will invoke the single-precision versions which will
145also return single-precision values, which is (even if maybe performing the
146exact same computation, see below) not as conceptually clean when working in a
147half-precision environment.
148
149The default rounding mode for conversions between half and more precise types
150as well as for rounding results of arithmetic operations and mathematical
151functions rounds to the nearest representable value. But by predefining the
152'HALF_ROUND_STYLE' preprocessor symbol this default can be overridden with one
153of the other standard rounding modes using their respective constants or the
154equivalent values of 'std::float_round_style' (it can even be synchronized with
155the built-in single-precision implementation by defining it to
156'std::numeric_limits<float>::round_style'):
157
158 - 'std::round_indeterminate' (-1) for the fastest rounding.
159
160 - 'std::round_toward_zero' (0) for rounding toward zero.
161
162 - 'std::round_to_nearest' (1) for rounding to the nearest value (default).
163
164 - 'std::round_toward_infinity' (2) for rounding toward positive infinity.
165
166 - 'std::round_toward_neg_infinity' (3) for rounding toward negative infinity.
167
168In addition to changing the overall default rounding mode one can also use the
169'half_cast'. This converts between half and any built-in arithmetic type using
170a configurable rounding mode (or the default rounding mode if none is
171specified). In addition to a configurable rounding mode, 'half_cast' has
172another big difference to a mere 'static_cast': Any conversions are performed
173directly using the given rounding mode, without any intermediate conversion
174to/from 'float'. This is especially relevant for conversions to integer types,
175which don't necessarily truncate anymore. But also for conversions from
176'double' or 'long double' this may produce more precise results than a
177pre-conversion to 'float' using the single-precision implementation's current
178rounding mode would.
179
180 half a = half_cast<half>(4.2);
181 half b = half_cast<half,std::numeric_limits<float>::round_style>(4.2f);
182 assert( half_cast<int, std::round_to_nearest>( 0.7_h ) == 1 );
183 assert( half_cast<half,std::round_toward_zero>( 4097 ) == 4096.0_h );
184 assert( half_cast<half,std::round_toward_infinity>( 4097 ) == 4100.0_h );
185 assert( half_cast<half,std::round_toward_infinity>( std::numeric_limits<double>::min() ) > 0.0_h );
186
187ACCURACY AND PERFORMANCE
188
189From version 2.0 onward the library is implemented without employing the
190underlying floating-point implementation of the system (except for conversions,
191of course), providing an entirely self-contained half-precision implementation
192with results independent from the system's existing single- or double-precision
193implementation and its rounding behaviour.
194
195As to accuracy, many of the operators and functions provided by this library
196are exact to rounding for all rounding modes, i.e. the error to the exact
197result is at most 0.5 ULP (unit in the last place) for rounding to nearest and
198less than 1 ULP for all other rounding modes. This holds for all the operations
199required by the IEEE 754 standard and many more. Specifically the following
200functions might exhibit a deviation from the correctly rounded exact result by
2011 ULP for a select few input values: 'expm1', 'log1p', 'pow', 'atan2', 'erf',
202'erfc', 'lgamma', 'tgamma' (for more details see the documentation of the
203individual functions). All other functions and operators are always exact to
204rounding or independent of the rounding mode altogether.
205
206The increased IEEE-conformance and cleanliness of this implementation comes
207with a certain performance cost compared to doing computations and mathematical
208functions in hardware-accelerated single-precision. On average and depending on
209the platform, the arithemtic operators are about 75% as fast and the
210mathematical functions about 33-50% as fast as performing the corresponding
211operations in single-precision and converting between the inputs and outputs.
212However, directly computing with half-precision values is a rather rare
213use-case and usually using actual 'float' values for all computations and
214temproraries and using 'half's only for storage is the recommended way. But
215nevertheless the goal of this library was to provide a complete and
216conceptually clean IEEE-confromant half-precision implementation and in the few
217cases when you do need to compute directly in half-precision you do so for a
218reason and want accurate results.
219
220If necessary, this internal implementation can be overridden by predefining the
221'HALF_ARITHMETIC_TYPE' preprocessor symbol to one of the built-in
222floating-point types ('float', 'double' or 'long double'), which will cause the
223library to use this type for computing arithmetic operations and mathematical
224functions (if available). However, due to using the platform's floating-point
225implementation (and its rounding behaviour) internally, this might cause
226results to deviate from the specified half-precision rounding mode. It will of
227course also inhibit the automatic exception detection described below.
228
229The conversion operations between half-precision and single-precision types can
230also make use of the F16C extension for x86 processors by using the
231corresponding compiler intrinsics from <immintrin.h>. Support for this is
232checked at compile-time by looking for the '__F16C__' macro which at least gcc
233and clang define based on the target platform. It can also be enabled manually
234by predefining the 'HALF_ENABLE_F16C_INTRINSICS' preprocessor symbol to 1, or 0
235for explicitly disabling it. However, this will directly use the corresponding
236intrinsics for conversion without checking if they are available at runtime
237(possibly crashing if they are not), so make sure they are supported on the
238target platform before enabling this.
239
240EXCEPTION HANDLING
241
242The half-precision implementation supports all 5 required floating-point
243exceptions from the IEEE standard to indicate erroneous inputs or inexact
244results during operations. These are represented by exception flags which
245actually use the same values as the corresponding 'FE_...' flags defined in
246C++11's <cfenv> header if supported, specifically:
247
248 - 'FE_INVALID' for invalid inputs to an operation.
249 - 'FE_DIVBYZERO' for finite inputs producing infinite results.
250 - 'FE_OVERFLOW' if a result is too large to represent finitely.
251 - 'FE_UNDERFLOW' for a subnormal or zero result after rounding.
252 - 'FE_INEXACT' if a result needed rounding to be representable.
253 - 'FE_ALL_EXCEPT' as a convenient OR of all possible exception flags.
254
255The internal exception flag state will start with all flags cleared and is
256maintained per thread if C++11 thread-local storage is supported, otherwise it
257will be maintained globally and will theoretically NOT be thread-safe (while
258practically being as thread-safe as a simple integer variable can be). These
259flags can be managed explicitly using the library's error handling functions,
260which again try to mimic the built-in functions for handling floating-point
261exceptions from <cfenv>. You can clear them with 'feclearexcept' (which is the
262only way a flag can be cleared), test them with 'fetestexcept', explicitly
263raise errors with 'feraiseexcept' and save and restore their state using
264'fegetexceptflag' and 'fesetexceptflag'. You can also throw corresponding C++
265exceptions based on the current flag state using 'fethrowexcept'.
266
267However, any automatic exception detection and handling during half-precision
268operations and functions is DISABLED by default, since it comes with a minor
269performance overhead due to runtime checks, and reacting to IEEE floating-point
270exceptions is rarely ever needed in application code. But the library fully
271supports IEEE-conformant detection of floating-point exceptions and various
272ways for handling them, which can be enabled by pre-defining the corresponding
273preprocessor symbols to 1. They can be enabled individually or all at once and
274they will be processed in the order they are listed here:
275
276 - 'HALF_ERRHANDLING_FLAGS' sets the internal exception flags described above
277 whenever the corresponding exception occurs.
278 - 'HALF_ERRHANDLING_ERRNO' sets the value of 'errno' from <cerrno> similar to
279 the behaviour of the built-in floating-point types when 'MATH_ERRNO' is used.
280 - 'HALF_ERRHANDLING_FENV' will propagate exceptions to the built-in
281 floating-point implementation using 'std::feraiseexcept' if support for
282 C++11 floating-point control is enabled. However, this does not synchronize
283 exceptions: neither will clearing propagate nor will it work in reverse.
284 - 'HALF_ERRHANDLING_THROW_...' can be defined to a string literal which will
285 be used as description message for a C++ exception that is thrown whenever
286 a 'FE_...' exception occurs, similar to the behaviour of 'fethrowexcept'.
287
288If any of the above error handling is activated, non-quiet operations on
289half-precision values will also raise a 'FE_INVALID' exception whenever
290they encounter a signaling NaN value, in addition to transforming the value
291into a quiet NaN. If error handling is disabled, signaling NaNs will be
292treated like quiet NaNs (while still getting explicitly quieted if propagated
293to the result). There can also be additional treatment of overflow and
294underflow errors after they have been processed as above, which is ENABLED by
295default (but of course only takes effect if any other exception handling is
296activated) unless overridden by pre-defining the corresponding preprocessor
297symbol to 0:
298
299 - 'HALF_ERRHANDLING_OVERFLOW_TO_INEXACT' will cause overflow errors to also
300 raise a 'FE_INEXACT' exception.
301 - 'HALF_ERRHANDLING_UNDERFLOW_TO_INEXACT' will cause underflow errors to also
302 raise a 'FE_INEXACT' exception. This will also slightly change the
303 behaviour of the underflow exception, which will ONLY be raised if the
304 result is actually inexact due to underflow. If this is disabled, underflow
305 exceptions will be raised for ANY (possibly exact) subnormal result.
306
307
308CREDITS AND CONTACT
309-------------------
310
311This library is developed by CHRISTIAN RAU and released under the MIT License
312(see LICENSE.txt). If you have any questions or problems with it, feel free to
313contact me at rauy@users.sourceforge.net.
314
315Additional credit goes to JEROEN VAN DER ZIJP for his paper on "Fast Half Float
316Conversions", whose algorithms have been used in the library for converting
317between half-precision and single-precision values.