Wikipedia: Floating point

Definition

A floating point number is a digital representation for a real number, where the number of [significant digits]? is constant, rather than the absolute precision. In other words, we could represent a number a by two numbers c and d, such that a = c × b^d, where c (which is called the mantissa) is in the range 1 / b < c ≤ 1 and b is a fixed system constant (called the base of numeration). d is called the exponent. That way a greater range of numbers can be represented within a limited precision field, which is not possible in a fixed point notation.

As an example, a floating point number with four decimal digits could be used to represent 4321 or 0.00004321, but would round 432.123 to 432.1 and 43212.3 to 43210. Of course in practice the number of digits is often larger than four.

Representation

Binary floating point representation in computer is analogous to scientific notation of decimal number. For example, the number 0.00001234 is written as 1.234 × 10^-5 in decimal scientific notation, or the number 123400 is written as 1.234 × 10⁵. Notice that the significant digits are normalized to start with a non-zero digit.

In floating point represenation, the number is divided into three parts. A sign bit to indicate negative numbers, a certain number of bits are allocated to represent the mantissa and the remaining of the bits are allocated to represent the exponent. The exponent is usually represented as an binary integer using 2's complement notation. The radix point and the base of 2 are understood and not included in the representation.

Similar to scientific notation, floating point number is normalized so that the first digit after the radix point is always non-zero. For example, 22₁₀ = 10110₂. When this number is represented in floating point, it becomes 0.1011₂ × 2⁵. The radix point is understood and the base is understood and hence the number is represented as 0 in the sign bit, 1011₂ in the mantissa and 5₁₀ or 101₂ in the exponent, or 0101100......00101 where the ... are zeros filled to the precision of the computer number system, the bits in front are the sign bit and mantissa, the bits at the back are for the exponent.

Usage in computing

While in the examples above the numbers are represented in the decimal system (that is the base of numeration, b = 10, computers usually do so in the binary system, which means that b = 2). In computers, floating point numbers are sized by the number of bits used to store them. This size is usually 32 bits or 64 bits, often called "single precision" and "double precision". A few machines offer larger sizes; Intel FPUs? such as 8087? (and its descendands integrated into the x86 architecture) offer 80 bit floating point numbers for intermediate results, and several systems offer 128 bit floating point, generally implemented in software.

The IEEE have standized the computer representation in IEEE 754. This standard is followed by almost all modern machines. The only exceptions are IBM Mainframes, which recently acquired an IEEE mode, and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating point format.

Examples

The value of Pi, π = 3.1415926...₁₀ decimal, which is equivalent to binary 11.001001000011111...₂. When represented in a computer that allocates 17 bits for the mantissa, it will become 0.11001001000011111 × 2². Hence the floating point representation would starts with bits 01100100100001111 and end with bits 10 (which represent the exponent 2 in the binary system). Note: the first zero indicate a positive number, the ending 10₂ = 2₁₀.)

The value of -0.375₁₀ = 0.011₂ or 0.11 × 2^-1. In 2's complement notation, -1 is represented as 11111111 (assuming 8 bits are used in the exponent). In floating point notation, the number with start with a 1 for sign bit, followed by 110000... and then followed by 11111111 at the end, or 1110...011111111 (where ... are zeros).