Blame - third_party/half/README.txt - tosa/serialization_lib

blob: 3dd0d1c2d83813c11c9a6150a18fbae26587c8d2 [file] [log] [blame]

James Ward	485a11d	2022-08-05 13:48:37 +0100	[diff] [blame]	1	HALF-PRECISION FLOATING-POINT LIBRARY (Version 2.2.0)
				2	-----------------------------------------------------
				3
				4	This is a C++ header-only library to provide an IEEE 754 conformant 16-bit
				5	half-precision floating-point type along with corresponding arithmetic
				6	operators, type conversions and common mathematical functions. It aims for both
				7	efficiency and ease of use, trying to accurately mimic the behaviour of the
				8	built-in floating-point types at the best performance possible.
				9
				10
				11	INSTALLATION AND REQUIREMENTS
				12	-----------------------------
				13
				14	Conveniently, the library consists of just a single header file containing all
				15	the functionality, which can be directly included by your projects, without the
				16	neccessity to build anything or link to anything.
				17
				18	Whereas this library is fully C++98-compatible, it can profit from certain
				19	C++11 features. Support for those features is checked automatically at compile
				20	(or rather preprocessing) time, but can be explicitly enabled or disabled by
				21	predefining the corresponding preprocessor symbols to either 1 or 0 yourself
				22	before including half.hpp. This is useful when the automatic detection fails
				23	(for more exotic implementations) or when a feature should be explicitly
				24	disabled:
				25
				26	- 'long long' integer type for mathematical functions returning 'long long'
				27	results (enabled for VC++ 2003 and icc 11.1 and newer, gcc and clang,
				28	overridable with 'HALF_ENABLE_CPP11_LONG_LONG').
				29
				30	- Static assertions for extended compile-time checks (enabled for VC++ 2010,
				31	gcc 4.3, clang 2.9, icc 11.1 and newer, overridable with
				32	'HALF_ENABLE_CPP11_STATIC_ASSERT').
				33
				34	- Generalized constant expressions (enabled for VC++ 2015, gcc 4.6, clang 3.1,
				35	icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_CONSTEXPR').
				36
				37	- noexcept exception specifications (enabled for VC++ 2015, gcc 4.6,
				38	clang 3.0, icc 14.0 and newer, overridable with 'HALF_ENABLE_CPP11_NOEXCEPT').
				39
				40	- User-defined literals for half-precision literals to work (enabled for
				41	VC++ 2015, gcc 4.7, clang 3.1, icc 15.0 and newer, overridable with
				42	'HALF_ENABLE_CPP11_USER_LITERALS').
				43
				44	- Thread-local storage for per-thread floating-point exception flags (enabled
				45	for VC++ 2015, gcc 4.8, clang 3.3, icc 15.0 and newer, overridable with
				46	'HALF_ENABLE_CPP11_THREAD_LOCAL').
				47
				48	- Type traits and template meta-programming features from <type_traits>
				49	(enabled for VC++ 2010, libstdc++ 4.3, libc++ and newer, overridable with
				50	'HALF_ENABLE_CPP11_TYPE_TRAITS').
				51
				52	- Special integer types from <cstdint> (enabled for VC++ 2010, libstdc++ 4.3,
				53	libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CSTDINT').
				54
				55	- Certain C++11 single-precision mathematical functions from <cmath> for
				56	floating-point classification during conversions from higher precision types
				57	(enabled for VC++ 2013, libstdc++ 4.3, libc++ and newer, overridable with
				58	'HALF_ENABLE_CPP11_CMATH').
				59
				60	- Floating-point environment control from <cfenv> for possible exception
				61	propagation to the built-in floating-point platform (enabled for VC++ 2013,
				62	libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_CFENV').
				63
				64	- Hash functor 'std::hash' from <functional> (enabled for VC++ 2010,
				65	libstdc++ 4.3, libc++ and newer, overridable with 'HALF_ENABLE_CPP11_HASH').
				66
				67	The library has been tested successfully with Visual C++ 2005-2015, gcc 4-8
				68	and clang 3-8 on 32- and 64-bit x86 systems. Please contact me if you have any
				69	problems, suggestions or even just success testing it on other platforms.
				70
				71
				72	DOCUMENTATION
				73	-------------
				74
				75	What follows are some general words about the usage of the library and its
				76	implementation. For a complete documentation of its interface consult the
				77	corresponding website http://half.sourceforge.net. You may also generate the
				78	complete developer documentation from the library's only include file's doxygen
				79	comments, but this is more relevant to developers rather than mere users.
				80
				81	BASIC USAGE
				82
				83	To make use of the library just include its only header file half.hpp, which
				84	defines all half-precision functionality inside the 'half_float' namespace. The
				85	actual 16-bit half-precision data type is represented by the 'half' type, which
				86	uses the standard IEEE representation with 1 sign bit, 5 exponent bits and 11
				87	mantissa bits (including the hidden bit) and supports all types of special
				88	values, like subnormal values, infinity and NaNs. This type behaves like the
				89	built-in floating-point types as much as possible, supporting the usual
				90	arithmetic, comparison and streaming operators, which makes its use pretty
				91	straight-forward:
				92
				93	using half_float::half;
				94	half a(3.4), b(5);
				95	half c = a * b;
				96	c += 3;
				97	if(c > a)
				98	std::cout << c << std::endl;
				99
				100	Additionally the 'half_float' namespace also defines half-precision versions
				101	for all mathematical functions of the C++ standard library, which can be used
				102	directly through ADL:
				103
				104	half a(-3.14159);
				105	half s = sin(abs(a));
				106	long l = lround(s);
				107
				108	You may also specify explicit half-precision literals, since the library
				109	provides a user-defined literal inside the 'half_float::literal' namespace,
				110	which you just need to import (assuming support for C++11 user-defined literals):
				111
				112	using namespace half_float::literal;
				113	half x = 1.0_h;
				114
				115	Furthermore the library provides proper specializations for
				116	'std::numeric_limits', defining various implementation properties, and
				117	'std::hash' for hashing half-precision numbers (assuming support for C++11
				118	'std::hash'). Similar to the corresponding preprocessor symbols from <cmath>
				119	the library also defines the 'HUGE_VALH' constant and maybe the 'FP_FAST_FMAH'
				120	symbol.
				121
				122	CONVERSIONS AND ROUNDING
				123
				124	The half is explicitly constructible/convertible from a single-precision float
				125	argument. Thus it is also explicitly constructible/convertible from any type
				126	implicitly convertible to float, but constructing it from types like double or
				127	int will involve the usual warnings arising when implicitly converting those to
				128	float because of the lost precision. On the one hand those warnings are
				129	intentional, because converting those types to half neccessarily also reduces
				130	precision. But on the other hand they are raised for explicit conversions from
				131	those types, when the user knows what he is doing. So if those warnings keep
				132	bugging you, then you won't get around first explicitly converting to float
				133	before converting to half, or use the 'half_cast' described below. In addition
				134	you can also directly assign float values to halfs.
				135
				136	In contrast to the float-to-half conversion, which reduces precision, the
				137	conversion from half to float (and thus to any other type implicitly
				138	convertible from float) is implicit, because all values represetable with
				139	half-precision are also representable with single-precision. This way the
				140	half-to-float conversion behaves similar to the builtin float-to-double
				141	conversion and all arithmetic expressions involving both half-precision and
				142	single-precision arguments will be of single-precision type. This way you can
				143	also directly use the mathematical functions of the C++ standard library,
				144	though in this case you will invoke the single-precision versions which will
				145	also return single-precision values, which is (even if maybe performing the
				146	exact same computation, see below) not as conceptually clean when working in a
				147	half-precision environment.
				148
				149	The default rounding mode for conversions between half and more precise types
				150	as well as for rounding results of arithmetic operations and mathematical
				151	functions rounds to the nearest representable value. But by predefining the
				152	'HALF_ROUND_STYLE' preprocessor symbol this default can be overridden with one
				153	of the other standard rounding modes using their respective constants or the
				154	equivalent values of 'std::float_round_style' (it can even be synchronized with
				155	the built-in single-precision implementation by defining it to
				156	'std::numeric_limits<float>::round_style'):
				157
				158	- 'std::round_indeterminate' (-1) for the fastest rounding.
				159
				160	- 'std::round_toward_zero' (0) for rounding toward zero.
				161
				162	- 'std::round_to_nearest' (1) for rounding to the nearest value (default).
				163
				164	- 'std::round_toward_infinity' (2) for rounding toward positive infinity.
				165
				166	- 'std::round_toward_neg_infinity' (3) for rounding toward negative infinity.
				167
				168	In addition to changing the overall default rounding mode one can also use the
				169	'half_cast'. This converts between half and any built-in arithmetic type using
				170	a configurable rounding mode (or the default rounding mode if none is
				171	specified). In addition to a configurable rounding mode, 'half_cast' has
				172	another big difference to a mere 'static_cast': Any conversions are performed
				173	directly using the given rounding mode, without any intermediate conversion
				174	to/from 'float'. This is especially relevant for conversions to integer types,
				175	which don't necessarily truncate anymore. But also for conversions from
				176	'double' or 'long double' this may produce more precise results than a
				177	pre-conversion to 'float' using the single-precision implementation's current
				178	rounding mode would.
				179
				180	half a = half_cast<half>(4.2);
				181	half b = half_cast<half,std::numeric_limits<float>::round_style>(4.2f);
				182	assert( half_cast<int, std::round_to_nearest>( 0.7_h ) == 1 );
				183	assert( half_cast<half,std::round_toward_zero>( 4097 ) == 4096.0_h );
				184	assert( half_cast<half,std::round_toward_infinity>( 4097 ) == 4100.0_h );
				185	assert( half_cast<half,std::round_toward_infinity>( std::numeric_limits<double>::min() ) > 0.0_h );
				186
				187	ACCURACY AND PERFORMANCE
				188
				189	From version 2.0 onward the library is implemented without employing the
				190	underlying floating-point implementation of the system (except for conversions,
				191	of course), providing an entirely self-contained half-precision implementation
				192	with results independent from the system's existing single- or double-precision
				193	implementation and its rounding behaviour.
				194
				195	As to accuracy, many of the operators and functions provided by this library
				196	are exact to rounding for all rounding modes, i.e. the error to the exact
				197	result is at most 0.5 ULP (unit in the last place) for rounding to nearest and
				198	less than 1 ULP for all other rounding modes. This holds for all the operations
				199	required by the IEEE 754 standard and many more. Specifically the following
				200	functions might exhibit a deviation from the correctly rounded exact result by
				201	1 ULP for a select few input values: 'expm1', 'log1p', 'pow', 'atan2', 'erf',
				202	'erfc', 'lgamma', 'tgamma' (for more details see the documentation of the
				203	individual functions). All other functions and operators are always exact to
				204	rounding or independent of the rounding mode altogether.
				205
				206	The increased IEEE-conformance and cleanliness of this implementation comes
				207	with a certain performance cost compared to doing computations and mathematical
				208	functions in hardware-accelerated single-precision. On average and depending on
				209	the platform, the arithemtic operators are about 75% as fast and the
				210	mathematical functions about 33-50% as fast as performing the corresponding
				211	operations in single-precision and converting between the inputs and outputs.
				212	However, directly computing with half-precision values is a rather rare
				213	use-case and usually using actual 'float' values for all computations and
				214	temproraries and using 'half's only for storage is the recommended way. But
				215	nevertheless the goal of this library was to provide a complete and
				216	conceptually clean IEEE-confromant half-precision implementation and in the few
				217	cases when you do need to compute directly in half-precision you do so for a
				218	reason and want accurate results.
				219
				220	If necessary, this internal implementation can be overridden by predefining the
				221	'HALF_ARITHMETIC_TYPE' preprocessor symbol to one of the built-in
				222	floating-point types ('float', 'double' or 'long double'), which will cause the
				223	library to use this type for computing arithmetic operations and mathematical
				224	functions (if available). However, due to using the platform's floating-point
				225	implementation (and its rounding behaviour) internally, this might cause
				226	results to deviate from the specified half-precision rounding mode. It will of
				227	course also inhibit the automatic exception detection described below.
				228
				229	The conversion operations between half-precision and single-precision types can
				230	also make use of the F16C extension for x86 processors by using the
				231	corresponding compiler intrinsics from <immintrin.h>. Support for this is
				232	checked at compile-time by looking for the '__F16C__' macro which at least gcc
				233	and clang define based on the target platform. It can also be enabled manually
				234	by predefining the 'HALF_ENABLE_F16C_INTRINSICS' preprocessor symbol to 1, or 0
				235	for explicitly disabling it. However, this will directly use the corresponding
				236	intrinsics for conversion without checking if they are available at runtime
				237	(possibly crashing if they are not), so make sure they are supported on the
				238	target platform before enabling this.
				239
				240	EXCEPTION HANDLING
				241
				242	The half-precision implementation supports all 5 required floating-point
				243	exceptions from the IEEE standard to indicate erroneous inputs or inexact
				244	results during operations. These are represented by exception flags which
				245	actually use the same values as the corresponding 'FE_...' flags defined in
				246	C++11's <cfenv> header if supported, specifically:
				247
				248	- 'FE_INVALID' for invalid inputs to an operation.
				249	- 'FE_DIVBYZERO' for finite inputs producing infinite results.
				250	- 'FE_OVERFLOW' if a result is too large to represent finitely.
				251	- 'FE_UNDERFLOW' for a subnormal or zero result after rounding.
				252	- 'FE_INEXACT' if a result needed rounding to be representable.
				253	- 'FE_ALL_EXCEPT' as a convenient OR of all possible exception flags.
				254
				255	The internal exception flag state will start with all flags cleared and is
				256	maintained per thread if C++11 thread-local storage is supported, otherwise it
				257	will be maintained globally and will theoretically NOT be thread-safe (while
				258	practically being as thread-safe as a simple integer variable can be). These
				259	flags can be managed explicitly using the library's error handling functions,
				260	which again try to mimic the built-in functions for handling floating-point
				261	exceptions from <cfenv>. You can clear them with 'feclearexcept' (which is the
				262	only way a flag can be cleared), test them with 'fetestexcept', explicitly
				263	raise errors with 'feraiseexcept' and save and restore their state using
				264	'fegetexceptflag' and 'fesetexceptflag'. You can also throw corresponding C++
				265	exceptions based on the current flag state using 'fethrowexcept'.
				266
				267	However, any automatic exception detection and handling during half-precision
				268	operations and functions is DISABLED by default, since it comes with a minor
				269	performance overhead due to runtime checks, and reacting to IEEE floating-point
				270	exceptions is rarely ever needed in application code. But the library fully
				271	supports IEEE-conformant detection of floating-point exceptions and various
				272	ways for handling them, which can be enabled by pre-defining the corresponding
				273	preprocessor symbols to 1. They can be enabled individually or all at once and
				274	they will be processed in the order they are listed here:
				275
				276	- 'HALF_ERRHANDLING_FLAGS' sets the internal exception flags described above
				277	whenever the corresponding exception occurs.
				278	- 'HALF_ERRHANDLING_ERRNO' sets the value of 'errno' from <cerrno> similar to
				279	the behaviour of the built-in floating-point types when 'MATH_ERRNO' is used.
				280	- 'HALF_ERRHANDLING_FENV' will propagate exceptions to the built-in
				281	floating-point implementation using 'std::feraiseexcept' if support for
				282	C++11 floating-point control is enabled. However, this does not synchronize
				283	exceptions: neither will clearing propagate nor will it work in reverse.
				284	- 'HALF_ERRHANDLING_THROW_...' can be defined to a string literal which will
				285	be used as description message for a C++ exception that is thrown whenever
				286	a 'FE_...' exception occurs, similar to the behaviour of 'fethrowexcept'.
				287
				288	If any of the above error handling is activated, non-quiet operations on
				289	half-precision values will also raise a 'FE_INVALID' exception whenever
				290	they encounter a signaling NaN value, in addition to transforming the value
				291	into a quiet NaN. If error handling is disabled, signaling NaNs will be
				292	treated like quiet NaNs (while still getting explicitly quieted if propagated
				293	to the result). There can also be additional treatment of overflow and
				294	underflow errors after they have been processed as above, which is ENABLED by
				295	default (but of course only takes effect if any other exception handling is
				296	activated) unless overridden by pre-defining the corresponding preprocessor
				297	symbol to 0:
				298
				299	- 'HALF_ERRHANDLING_OVERFLOW_TO_INEXACT' will cause overflow errors to also
				300	raise a 'FE_INEXACT' exception.
				301	- 'HALF_ERRHANDLING_UNDERFLOW_TO_INEXACT' will cause underflow errors to also
				302	raise a 'FE_INEXACT' exception. This will also slightly change the
				303	behaviour of the underflow exception, which will ONLY be raised if the
				304	result is actually inexact due to underflow. If this is disabled, underflow
				305	exceptions will be raised for ANY (possibly exact) subnormal result.
				306
				307
				308	CREDITS AND CONTACT
				309	-------------------
				310
				311	This library is developed by CHRISTIAN RAU and released under the MIT License
				312	(see LICENSE.txt). If you have any questions or problems with it, feel free to
				313	contact me at rauy@users.sourceforge.net.
				314
				315	Additional credit goes to JEROEN VAN DER ZIJP for his paper on "Fast Half Float
				316	Conversions", whose algorithms have been used in the library for converting
				317	between half-precision and single-precision values.