blob: 3a0960c1258c25c9c2ccd160e4b1ff4c43dd3778 [file] [log] [blame]
telsoa01c577f2c2018-08-31 09:22:23 +01001HALF-PRECISION FLOATING POINT LIBRARY (Version 1.12.0)
2------------------------------------------------------
3
4This is a C++ header-only library to provide an IEEE 754 conformant 16-bit
5half-precision floating point type along with corresponding arithmetic
6operators, type conversions and common mathematical functions. It aims for both
7efficiency and ease of use, trying to accurately mimic the behaviour of the
8builtin floating point types at the best performance possible.
9
10
11INSTALLATION AND REQUIREMENTS
12-----------------------------
13
14Comfortably enough, the library consists of just a single header file
15containing all the functionality, which can be directly included by your
16projects, without the neccessity to build anything or link to anything.
17
18Whereas this library is fully C++98-compatible, it can profit from certain
19C++11 features. Support for those features is checked automatically at compile
20(or rather preprocessing) time, but can be explicitly enabled or disabled by
21defining the corresponding preprocessor symbols to either 1 or 0 yourself. This
22is useful when the automatic detection fails (for more exotic implementations)
23or when a feature should be explicitly disabled:
24
25 - 'long long' integer type for mathematical functions returning 'long long'
26 results (enabled for VC++ 2003 and newer, gcc and clang, overridable with
27 'HALF_ENABLE_CPP11_LONG_LONG').
28
29 - Static assertions for extended compile-time checks (enabled for VC++ 2010,
30 gcc 4.3, clang 2.9 and newer, overridable with 'HALF_ENABLE_CPP11_STATIC_ASSERT').
31
32 - Generalized constant expressions (enabled for VC++ 2015, gcc 4.6, clang 3.1
33 and newer, overridable with 'HALF_ENABLE_CPP11_CONSTEXPR').
34
35 - noexcept exception specifications (enabled for VC++ 2015, gcc 4.6, clang 3.0
36 and newer, overridable with 'HALF_ENABLE_CPP11_NOEXCEPT').
37
38 - User-defined literals for half-precision literals to work (enabled for
39 VC++ 2015, gcc 4.7, clang 3.1 and newer, overridable with
40 'HALF_ENABLE_CPP11_USER_LITERALS').
41
42 - Type traits and template meta-programming features from <type_traits>
43 (enabled for VC++ 2010, libstdc++ 4.3, libc++ and newer, overridable with
44 'HALF_ENABLE_CPP11_TYPE_TRAITS').
45
46 - Special integer types from <cstdint> (enabled for VC++ 2010, libstdc++ 4.3,
47 libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CSTDINT').
48
49 - Certain C++11 single-precision mathematical functions from <cmath> for
50 an improved implementation of their half-precision counterparts to work
51 (enabled for VC++ 2013, libstdc++ 4.3, libc++ and newer, overridable with
52 'HALF_ENABLE_CPP11_CMATH').
53
54 - Hash functor 'std::hash' from <functional> (enabled for VC++ 2010,
55 libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_HASH').
56
57The library has been tested successfully with Visual C++ 2005-2015, gcc 4.4-4.8
58and clang 3.1. Please contact me if you have any problems, suggestions or even
59just success testing it on other platforms.
60
61
62DOCUMENTATION
63-------------
64
65Here follow some general words about the usage of the library and its
66implementation. For a complete documentation of its iterface look at the
67corresponding website http://half.sourceforge.net. You may also generate the
68complete developer documentation from the library's only include file's doxygen
69comments, but this is more relevant to developers rather than mere users (for
70reasons described below).
71
72BASIC USAGE
73
74To make use of the library just include its only header file half.hpp, which
75defines all half-precision functionality inside the 'half_float' namespace. The
76actual 16-bit half-precision data type is represented by the 'half' type. This
77type behaves like the builtin floating point types as much as possible,
78supporting the usual arithmetic, comparison and streaming operators, which
79makes its use pretty straight-forward:
80
81 using half_float::half;
82 half a(3.4), b(5);
83 half c = a * b;
84 c += 3;
85 if(c > a)
86 std::cout << c << std::endl;
87
88Additionally the 'half_float' namespace also defines half-precision versions
89for all mathematical functions of the C++ standard library, which can be used
90directly through ADL:
91
92 half a(-3.14159);
93 half s = sin(abs(a));
94 long l = lround(s);
95
96You may also specify explicit half-precision literals, since the library
97provides a user-defined literal inside the 'half_float::literal' namespace,
98which you just need to import (assuming support for C++11 user-defined literals):
99
100 using namespace half_float::literal;
101 half x = 1.0_h;
102
103Furthermore the library provides proper specializations for
104'std::numeric_limits', defining various implementation properties, and
105'std::hash' for hashing half-precision numbers (assuming support for C++11
106'std::hash'). Similar to the corresponding preprocessor symbols from <cmath>
107the library also defines the 'HUGE_VALH' constant and maybe the 'FP_FAST_FMAH'
108symbol.
109
110CONVERSIONS AND ROUNDING
111
112The half is explicitly constructible/convertible from a single-precision float
113argument. Thus it is also explicitly constructible/convertible from any type
114implicitly convertible to float, but constructing it from types like double or
115int will involve the usual warnings arising when implicitly converting those to
116float because of the lost precision. On the one hand those warnings are
117intentional, because converting those types to half neccessarily also reduces
118precision. But on the other hand they are raised for explicit conversions from
119those types, when the user knows what he is doing. So if those warnings keep
120bugging you, then you won't get around first explicitly converting to float
121before converting to half, or use the 'half_cast' described below. In addition
122you can also directly assign float values to halfs.
123
124In contrast to the float-to-half conversion, which reduces precision, the
125conversion from half to float (and thus to any other type implicitly
126convertible from float) is implicit, because all values represetable with
127half-precision are also representable with single-precision. This way the
128half-to-float conversion behaves similar to the builtin float-to-double
129conversion and all arithmetic expressions involving both half-precision and
130single-precision arguments will be of single-precision type. This way you can
131also directly use the mathematical functions of the C++ standard library,
132though in this case you will invoke the single-precision versions which will
133also return single-precision values, which is (even if maybe performing the
134exact same computation, see below) not as conceptually clean when working in a
135half-precision environment.
136
137The default rounding mode for conversions from float to half uses truncation
138(round toward zero, but mapping overflows to infinity) for rounding values not
139representable exactly in half-precision. This is the fastest rounding possible
140and is usually sufficient. But by redefining the 'HALF_ROUND_STYLE'
141preprocessor symbol (before including half.hpp) this default can be overridden
142with one of the other standard rounding modes using their respective constants
143or the equivalent values of 'std::float_round_style' (it can even be
144synchronized with the underlying single-precision implementation by defining it
145to 'std::numeric_limits<float>::round_style'):
146
147 - 'std::round_indeterminate' or -1 for the fastest rounding (default).
148
149 - 'std::round_toward_zero' or 0 for rounding toward zero.
150
151 - std::round_to_nearest' or 1 for rounding to the nearest value.
152
153 - std::round_toward_infinity' or 2 for rounding toward positive infinity.
154
155 - std::round_toward_neg_infinity' or 3 for rounding toward negative infinity.
156
157In addition to changing the overall default rounding mode one can also use the
158'half_cast'. This converts between half and any built-in arithmetic type using
159a configurable rounding mode (or the default rounding mode if none is
160specified). In addition to a configurable rounding mode, 'half_cast' has
161another big difference to a mere 'static_cast': Any conversions are performed
162directly using the given rounding mode, without any intermediate conversion
163to/from 'float'. This is especially relevant for conversions to integer types,
164which don't necessarily truncate anymore. But also for conversions from
165'double' or 'long double' this may produce more precise results than a
166pre-conversion to 'float' using the single-precision implementation's current
167rounding mode would.
168
169 half a = half_cast<half>(4.2);
170 half b = half_cast<half,std::numeric_limits<float>::round_style>(4.2f);
171 assert( half_cast<int, std::round_to_nearest>( 0.7_h ) == 1 );
172 assert( half_cast<half,std::round_toward_zero>( 4097 ) == 4096.0_h );
173 assert( half_cast<half,std::round_toward_infinity>( 4097 ) == 4100.0_h );
174 assert( half_cast<half,std::round_toward_infinity>( std::numeric_limits<double>::min() ) > 0.0_h );
175
176When using round to nearest (either as default or through 'half_cast') ties are
177by default resolved by rounding them away from zero (and thus equal to the
178behaviour of the 'round' function). But by redefining the
179'HALF_ROUND_TIES_TO_EVEN' preprocessor symbol to 1 (before including half.hpp)
180this default can be changed to the slightly slower but less biased and more
181IEEE-conformant behaviour of rounding half-way cases to the nearest even value.
182
183 #define HALF_ROUND_TIES_TO_EVEN 1
184 #include <half.hpp>
185 ...
186 assert( half_cast<int,std::round_to_nearest>(3.5_h)
187 == half_cast<int,std::round_to_nearest>(4.5_h) );
188
189IMPLEMENTATION
190
191For performance reasons (and ease of implementation) many of the mathematical
192functions provided by the library as well as all arithmetic operations are
193actually carried out in single-precision under the hood, calling to the C++
194standard library implementations of those functions whenever appropriate,
195meaning the arguments are converted to floats and the result back to half. But
196to reduce the conversion overhead as much as possible any temporary values
197inside of lengthy expressions are kept in single-precision as long as possible,
198while still maintaining a strong half-precision type to the outside world. Only
199when finally assigning the value to a half or calling a function that works
200directly on halfs is the actual conversion done (or never, when further
201converting the result to float.
202
203This approach has two implications. First of all you have to treat the
204library's documentation at http://half.sourceforge.net as a simplified version,
205describing the behaviour of the library as if implemented this way. The actual
206argument and return types of functions and operators may involve other internal
207types (feel free to generate the exact developer documentation from the Doxygen
208comments in the library's header file if you really need to). But nevertheless
209the behaviour is exactly like specified in the documentation. The other
210implication is, that in the presence of rounding errors or over-/underflows
211arithmetic expressions may produce different results when compared to
212converting to half-precision after each individual operation:
213
214 half a = std::numeric_limits<half>::max() * 2.0_h / 2.0_h; // a = MAX
215 half b = half(std::numeric_limits<half>::max() * 2.0_h) / 2.0_h; // b = INF
216 assert( a != b );
217
218But this should only be a problem in very few cases. One last word has to be
219said when talking about performance. Even with its efforts in reducing
220conversion overhead as much as possible, the software half-precision
221implementation can most probably not beat the direct use of single-precision
222computations. Usually using actual float values for all computations and
223temproraries and using halfs only for storage is the recommended way. On the
224one hand this somehow makes the provided mathematical functions obsolete
225(especially in light of the implicit conversion from half to float), but
226nevertheless the goal of this library was to provide a complete and
227conceptually clean half-precision implementation, to which the standard
228mathematical functions belong, even if usually not needed.
229
230IEEE CONFORMANCE
231
232The half type uses the standard IEEE representation with 1 sign bit, 5 exponent
233bits and 10 mantissa bits (11 when counting the hidden bit). It supports all
234types of special values, like subnormal values, infinity and NaNs. But there
235are some limitations to the complete conformance to the IEEE 754 standard:
236
237 - The implementation does not differentiate between signalling and quiet
238 NaNs, this means operations on halfs are not specified to trap on
239 signalling NaNs (though they may, see last point).
240
241 - Though arithmetic operations are internally rounded to single-precision
242 using the underlying single-precision implementation's current rounding
243 mode, those values are then converted to half-precision using the default
244 half-precision rounding mode (changed by defining 'HALF_ROUND_STYLE'
245 accordingly). This mixture of rounding modes is also the reason why
246 'std::numeric_limits<half>::round_style' may actually return
247 'std::round_indeterminate' when half- and single-precision rounding modes
248 don't match.
249
250 - Because of internal truncation it may also be that certain single-precision
251 NaNs will be wrongly converted to half-precision infinity, though this is
252 very unlikely to happen, since most single-precision implementations don't
253 tend to only set the lowest bits of a NaN mantissa.
254
255 - The implementation does not provide any floating point exceptions, thus
256 arithmetic operations or mathematical functions are not specified to invoke
257 proper floating point exceptions. But due to many functions implemented in
258 single-precision, those may still invoke floating point exceptions of the
259 underlying single-precision implementation.
260
261Some of those points could have been circumvented by controlling the floating
262point environment using <cfenv> or implementing a similar exception mechanism.
263But this would have required excessive runtime checks giving two high an impact
264on performance for something that is rarely ever needed. If you really need to
265rely on proper floating point exceptions, it is recommended to explicitly
266perform computations using the built-in floating point types to be on the safe
267side. In the same way, if you really need to rely on a particular rounding
268behaviour, it is recommended to either use single-precision computations and
269explicitly convert the result to half-precision using 'half_cast' and
270specifying the desired rounding mode, or synchronize the default half-precision
271rounding mode to the rounding mode of the single-precision implementation (most
272likely 'HALF_ROUND_STYLE=1', 'HALF_ROUND_TIES_TO_EVEN=1'). But this is really
273considered an expert-scenario that should be used only when necessary, since
274actually working with half-precision usually comes with a certain
275tolerance/ignorance of exactness considerations and proper rounding comes with
276a certain performance cost.
277
278
279CREDITS AND CONTACT
280-------------------
281
282This library is developed by CHRISTIAN RAU and released under the MIT License
283(see LICENSE.txt). If you have any questions or problems with it, feel free to
284contact me at rauy@users.sourceforge.net.
285
286Additional credit goes to JEROEN VAN DER ZIJP for his paper on "Fast Half Float
287Conversions", whose algorithms have been used in the library for converting
288between half-precision and single-precision values.