Back
Fraction
Floating point storage notation.
Sign bit - separate flag for sign.
Signed biased exponent - adjusted so 0 is not in middle of number line.
Significand, mantissa, coefficient - modified form of significant digits.
Floating point limitations
IEEE 754 standard (1985) encoding.
IEEE 754-1987, IEEE 754-2008, ISO/IEC/IEEE 754-2011 - updates.
754-2011 is mostly clarification and document fixes.
5 basic formats * - common standards universally supported.
7 interchange formats - used for sharing data between systems.
Sometimes also used for special purpose storage.
e.g binary16 (half percision) - used for float math in video cards.
Binary - uses hidden bit convention and biased exponent.
16 bit (half) (sign, 5 exponent, 10 significand +1)
Primarily used by video co-processors for internal use.
32 bit (single) (sign, 8 exponent, 23 significand +1) *
64 bit (double) (sign, 11 exponent, 52 significand +1) *
128 bit (quad) (sign, 15 exponent, 112 significand +1) *
Early (1980-2008) systems often had custom sized software implemented
quad precision floats.
Decimal 32, 64*, 128* - used where precise decimal rounding required
such as financial and tax software.
Storage structure different and more variable.
All significant digits represented.
Standard also covers :
Extendable or extended formats - formats other than basic format but
that follow same rules as to how they are interpreted.
Systems and applications not required to support extended formats.
Extended format allows use of additional signficant digits as long
as the exponent size of the next largest base is used.
binary32 1 sign bit, 24 significant bits, 8 exponent bits.
binary64 1 sign bit, 53 significant bits, 11 exponent bits.
56 bit float
1 sign bit, 45 significant bits, 11 exponent bits.
# hidden bit included in significant bit counts.
Interchange formats - efficient ways of compacting and moving floats
between systems, including
converting extended formats to standard with minimal distortion
converting decimal values to binary
data compression
Rounding rules - for addition, multiplication, trig, etc..
Arithmetic operations - standardized algorithm for standard operations.
Exception handing such as :
Divide by zero - return representation of infinity rather than crash.
inexact - rounding rules for numbers that can't be stored perfectly.
Invalid operation - such as square root of negative number returns
a NaN value with flag info.
Overflow - result too large, returns infinity.
Underflow - denormalized result if not totally out of range.
Reproducible results - commutative and associative ordering should
give same results as in the real world or reproducible, predictable
variations.
Real world, order of actions usually have minimal effect.
Computers, same actions in different order can affect results greatly.
e.g
Linux dc command set to 4 bit precision.
4 k
1 3 / 1 3 / 1 3 / + + p
.9999
1 1 1 + + 3 / p
1
Number line representing float limitations.
Issues :
Precision vs. storage size and standardization.
Predefined sizes - single, double, etc.
Values must be expressed as sums of powers of 2
1/3, Pi, and number requiring more significant digits than are available.
Rules or rounding.
Numbers, both +/-, too large in magnitude for given storage.
Known as overflow.
Numbers, both +/-, too small in magnitude to represent in standard
format. Known as underflow.
Use of scientific notation allows a minor 'cheat'.
Given a representation with
1 integer digit,
4 fractional significant digits 1
1 digit exponent.
9.4213 x 10^-9 / 10 == 9.4213 x 10^-10
but we are only allowed a single exponent dight.
9.4213 x 10^-9 / 10 == 0.9421 x 10^-9
we now have a correct value with slighly less precission
But no longer in proper scientific notation but usable.
This is referred to as denormalized.
# Some systems/software using IEEE 754, will not accept denormalized values
and will replace with zero or indicate an error instead.
Representation of zero in chosen format.
Scientific notation issue.
Zero not valid value in standard scientific notation.
So, by special rule, if both the exponent and significand are
set to zero, the value is zero. Note sign is ignored.