Notes about Floating-Point Numbers in Assembly Language

In ordinary mathematics, we sometimes express numbers in "scientific notation such as:

          0.632 * 10^4

or

         -0.781 * 10^-3

In computing, we do something very much like this but in base 16:

          0.7A0 * 16^3

or

         -0.4BF * 16^-2

that is, a number is represented as fraction * 16^exponent, where the fraction is between 0.0 and 1.0 or between -1.0 and 0.0.

The same can be done in base 2, and in fact that is what is going on inside the CPU. We shall stick to base 16 for now.

In assembly language, there are at least two standard formats for floating-point numbers: short and long.

Short floating-point (32 bits):

Long floating-point (64 bits):

Examples

Converting these entirely to base 10 would be work.

In assembly language, we can declare variables of this type:

FNUM1   DC   E'6'             Result:  416000000 (short formay)

FNUM2   DC   D'3.1416'        Result:  413243FE5C91D14E

FNUM3   DC   E'-1234.567E5'   Result:  C775BCCC

Here in FNUM3, the E5 is an exponent in base 10:

     -1234.567 * 10^5

There could be multiple ways to represent a number using different exponents. In base 10, for instance:

      4 = 0.4 * 10^1 = 0.04 * 10*2

To avoid confusion, a floating-point number is called "normalized" if the first digit in its fraction is not 0. (If it is 0, move the decimal point over and adjust the exponent.) If the fraction is exacly 0, this cannot be done.

We have 4 floating-point registers, each 64 bits long, numbered 0, 2 4 and 6. If we are using the short format, only the first 32 bits of each FP register are used.

We can work with short floating-point values using operation such as:

LE   Load Short

     LE   R,D(X,B)

LER  Load Short Register

     LER  R1,R2 
          
STE  Store Short

     STE  R,D(X,B)

CE   Compare Short

     CE   R,D(X,B)

CER  Compare Short Register

     CER  R1,R2

LTER Load and Test Register Short

     LTER R1,R2  (sets the Condition Code)

AE   Add Short

      AE   R,D(X,B)

AER  Add Short Register

     AER  R1,R2

and likewise we have a set of instructions for long floating-point values.

The results of arithmetic are normalized afterward (if possible).


How can we convert a FP number of a decimal number?

  1. Express the FP number as: M * 16^N (M and N are integers).

  2. Convert the M to a decimal integer.

  3. If N > 0), multiply by 16^N; else divide by 16^-N.

Example

Start with 0.4ABC * 16^0.

This is 4ABC * 16^-4. Thus M = 4ABC and N = -4.

Convert M (base 16) to 19132 (base 10).

Divide 19132 by 16^4 to get 0.2912 (base 10).


What can go wrong with FP operations?

We could have any of these (at least):