In ordinary mathematics, we sometimes express numbers in "scientific notation such as:
0.632 * 10^4
or
-0.781 * 10^-3
In computing, we do something very much like this but in base 16:
0.7A0 * 16^3
or
-0.4BF * 16^-2
that is, a number is represented as fraction * 16^exponent, where the fraction is between 0.0 and 1.0 or between -1.0 and 0.0.
The same can be done in base 2, and in fact that is what is going on inside the CPU. We shall stick to base 16 for now.
In assembly language, there are at least two standard formats for floating-point numbers: short and long.
Short floating-point (32 bits):
Long floating-point (64 bits):
Examples
Sign bit: 1
Exponent: 67 - 64 = 3 (in base 10)
Fraction: 478100 (in base 16)
Value: -0.478100 * 16^3
Sign bit: 0
Exponent: 67 - 64 = 3 (in base 10)
Fraction: 478100 (in base 16)
Value: 0.478100 * 16^3
Sign bit: 1
Exponent: 5 - 64 = -59 (in base 10)
Fraction: 130101 (in base 16)
Value: -0.130101 * 16^-59 (the 59 is in base 10)
Converting these entirely to base 10 would be work.
In assembly language, we can declare variables of this type:
FNUM1 DC E'6' Result: 416000000 (short formay) FNUM2 DC D'3.1416' Result: 413243FE5C91D14E FNUM3 DC E'-1234.567E5' Result: C775BCCC
Here in FNUM3, the E5 is an exponent in base 10:
-1234.567 * 10^5
There could be multiple ways to represent a number using different exponents. In base 10, for instance:
4 = 0.4 * 10^1 = 0.04 * 10*2
To avoid confusion, a floating-point number is called "normalized" if the first digit in its fraction is not 0. (If it is 0, move the decimal point over and adjust the exponent.) If the fraction is exacly 0, this cannot be done.
We have 4 floating-point registers, each 64 bits long, numbered 0, 2 4 and 6. If we are using the short format, only the first 32 bits of each FP register are used.
We can work with short floating-point values using operation such as:
LE Load Short LE R,D(X,B) LER Load Short Register LER R1,R2 STE Store Short STE R,D(X,B) CE Compare Short CE R,D(X,B) CER Compare Short Register CER R1,R2 LTER Load and Test Register Short LTER R1,R2 (sets the Condition Code) AE Add Short AE R,D(X,B) AER Add Short Register AER R1,R2
and likewise we have a set of instructions for long floating-point values.
The results of arithmetic are normalized afterward (if possible).
How can we convert a FP number of a decimal number?
Example
Start with 0.4ABC * 16^0.
This is 4ABC * 16^-4. Thus M = 4ABC and N = -4.
Convert M (base 16) to 19132 (base 10).
Divide 19132 by 16^4 to get 0.2912 (base 10).
What can go wrong with FP operations?
We could have any of these (at least):