# Fixed and Floating Point Representation

In this tutorial, you will learn how to represent decimal and binary fractional numbers in a systematic way in number systems using fixed and floating-point representation. You will also learn about overflow and how to detect its occurrence.

Contents:

## Fixed-Point Representation

In the fixed-point Representation, the number of bits dedicated for the integer part and the fractional part gets fixed by fixing the radix point. For example, in an 8-bit binary number, if we fix the radix point at the 3rd bit then we get 5 bits for integer and 3 bits for the fraction part.

It can further be seen as signed fixed-point representation and unsigned fixed-point representation as shown in the figure.

As shown in the figure,

• Signed fixed-point Representation: – This representation will have three parts, the sign part, the integer part, and the fractional part. One bit is reserved for the sign, and the rest can be used for fixing the integer and fractional parts.
• Unsigned fixed-point Representation: – This Representation will have just two parts, the fixed integer, and the fractional part.
• A is a number with a fixed radix point.

The disadvantages of using fixed-point representation are given below: –

• It has only a limited range.
• This representation does not provide much precision as we cannot represent very large numbers or very small numbers.
• The non-terminating decimals can never be represented in fixed-point representation accurately.
• It does not have flexibility.

These disadvantages are removed by using floating-point representation.

## Floating-Point Representation

In the floating-point Representation, we represent numbers as sign-mantissa and exponent form {S, M, E}. The floating-point Representation follows the IEEE 754 standard. Here is a figure which shows floating-point representation.

As shown in the figure,

• The number of bits for its integer or fractional part is not fixed as its radix point is not fixed.
• It fixes the number of bits for the mantissa, sign, and exponent.
• The mantissa part gives the fractional part and the exponent gives the weight of the number in powers of 2.
• The sign tells whether the number is positive or negative. If it is 0 the number is positive and if 1, it is negative.

We have two precisions in floating-point: –

• Single-bit Precision: – 32-bits Representation, 1 sign bit, 23-bit mantissa, and 8-bit exponent
• Double-bit Precision: – 64-bits Representation, 1 sign bit, 52-bit mantissa, and 11-bit exponent

## Steps to Represent a Number in Floating-Point

Steps to Represent a Number in Floating-Point

• Represent the number in its signed Binary form.
• The radix-point is moved to the 1st point, and then we represent it as number x 2 to the power (the number of places we shifted the decimal by to the first place.)
For example, 110.111 is written as 1.10111 x 22, as we shifted the decimal by two places.
• If the first bit is 1, it is called normalized significand. We do not represent the first bit in floating-point representation. It is always considered as a default value.
• To the power of 2, we add the number 127 as a bias in the case of 32-bit precision and we add 1023 as bias in the case of 64-bit precision.
• The value of exponent always has to lie between 1 to 254, as we do not take all zeros and all ones.
• We represent this biased power as exponent in its 8-bit binary form.
• The numbers after the decimal point are stored as mantissa and zeros are added at the end of the number if 23 bits aren’t there.
• The sign bit is taken according to the negative or positive of the number.

For example, 1001101011.111 can be written as 1.001101011111 x 29, as we had to shift 9 decimal places to the left. After adding the bias of 127, we get 9 + 127 = 136. 136 in decimal is 10001000.
Thus in floating-point, it becomes,
0 10001000 001101011111
sign exponent mantissa

## Significant Bits in Decimal

The number of significant bits depends on the number of bits used in the mantissa. In the single-bit precision, 24 bits are used for mantissa including the one-implied bit. The bit just before the radix point is known as the implied bit and is one by default. So, we do not need to store this bit.

For example, 1.11011…….23-digits has the first digit one as an implied bit according to floating-point representation and the rest 23 bits are stored as mantissa. So, for calculation purposes, we assume 24-bit mantissa.

The number of significant digits is given for 32-bit as, 224 = 10x
Applying log on both sides, 24 log102 = x log1010 => 24 x 0.301
Therefore, x comes out nearly equal to 7.222

Thus, we have 7 significant bits in 32-bit precision. Similarly, in 64-bit precision, we have 53 bits for mantissa and its number of significant bits are given by,
253 = 10x, applying log on both sides, we have
53 log10 2 = x log1010 => 53 x 0.301
Therefore, x comes out nearly equal to 16.0

Thus, the 64-bit precision has 16 significant bits.

## Decimal Numbers in Floating-Point Representation

Here are the examples where decimal numbers are represented in floating-point Representation.

1. 163.75
The binary form of the number is 10100011.11

So, on shifting the decimal point to 1st place, we get 1.010001111 x 27.
We add 127 as a bias to 7, so we get, 127 + 7 = 134. 134 can be represented in binary as 10000110.
The sign is zero as the number is positive.
So, its floating-point representation is
0 10000110 010001111
sign exponent mantissa

2. – 1272.125
The binary form of the number is 10011111000.001

So, on shifting the decimal point to 1st place, we get 1.0011111000001 x 210.
We add 127 as a bias to 10, so we get, 127 + 10 = 137. 134 can be represented in binary as 10001001.
The sign is one as the number is negative.
So, its floating-point representation is
1 10001001 0011111000001
sign exponent mantissa

## The Concept of Overflow

Whenever we have two n-bit positive or negative signed numbers to add, if the result becomes greater than the range of our representation, then an overflow is said to occur.

For example,

• If we add 7+5 in binary signed numbers, 0111 + 0101 = 1100, here in signed form ‘1’ in the MSB means negative number, thus it will mean -4 in signed 2’s complement.

This isn’t possible, as two positive numbers on addition cannot result in a negative number. Thus, an overflow happened. This ambiguity can be removed if we use one more bit to represent the result. It would become 01100 which is 12 in binary and the perfect answer.

## Detection of Overflow

While performing signed addition in binary an overflow can be detected using the listed points: –

• If two positive numbers were added, the result does not come as positive, means an overflow occurred.
• If two negative numbers were added, and the result does not come out as negative means an overflow occurred.
• If the MSB of the result and the carry-in before the MSB do not match, there is always an overflow in this case.
• For example, on adding -6 and -3 in signed binary 2’s complement, 1010 + 1101 = 0111 in 4-bits, here the M.S.B and the carry bit are different so we have an overflow condition and we need to change the result to 5-bits, which becomes 10111 and in 2’s complement it gives, -9.

## Key Points to Remember

Here are the key points to remember in “Fixed and Floating-point Representation”.

• In the fixed-point representation, the position of radix is fixed. Hence, we can represent a limited range of numbers.
• In floating-point representation, we use sign-exponent and mantissa to represent numbers.
• Floating-point numbers use 32-bit and 64-bit precision which takes {23-bit mantissa, 8-bit exponent} and {52-bit mantissa, 11-bit exponent} respectively.
• The sign bit is 0 for positive decimal numbers and 1 for negative decimal numbers.
• The mantissa is the remaining number after the decimal point once it has been shifted to the first position.
• The exponent is represented as the number present in the power of 2 plus the bias for that bit precision. We then represent this number in binary.
• The 32-bit and 64-bit use 127 (28-1-1) and 1023 (211-1 – 1) respectively as their bias values.
• The condition of overflow may happen if we try to add two signed positive or negative numbers in binary. It is detected if the carry-bit does not match the M.S.B of the result.
• In the case of an overflow, we express the result as n+1 bit and then take its 2’s complement to get the correct result.

If you find any mistake above, kindly email to [email protected]