Floating Point

Aims to provide both short and simple answers to the common recurring questions of novice programmers about floating-point numbers not 'adding up' correctly.
Table of contents

The freed bit patterns allow three special classes of numbers: These are lower precision numbers obtained by removing the requirement that the leading digit in the mantissa be a one. Besides these three special classes, there are bit patterns that are not assigned a meaning, commonly referred to as NANs Not A Number. The IEEE standard for double precision simply adds more bits to the single precision format. Of the 64 bits used to store a double precision number, bits 0 through 51 are the mantissa, bits 52 through 62 are the exponent, and bit 63 is the sign bit. As before, the mantissa is between one and just under two, i.

The 11 exponent bits form a number between 0 and , with an offset of , allowing exponents from 2 to 2 The largest and smallest numbers allowed are?

How floating-point numbers work

These are incredibly large and small numbers! It is quite uncommon to find an application where single precision is not adequate. You will probably never find a case where double precision limits what you want to accomplish. Download this chapter in PDF format Chapter4. Table of contents 1: Program Language Execution Speed: Filter Comparison Match 1: Digital Filters Match 2: Neural Networks and more! The Digital Signal Processor Market How to order your own hardcover copy Wouldn't you rather have a bound book instead of loose pages?

Your laser printer will thank you! These bits form the floating point number, v , by the following relation: A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers grows with the chosen scale.

Over the years, a variety of floating-point representations have been used in computers. However, since the s, the most commonly encountered representation is that defined by the IEEE Standard. The speed of floating-point operations, commonly measured in terms of FLOPS , is an important characteristic of a computer system , especially for applications that involve intensive mathematical calculations.

A floating-point unit FPU, colloquially a math coprocessor is a part of a computer system specially designed to carry out operations on floating-point numbers. A number representation specifies some way of encoding a number, usually as a string of digits. There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character dot or comma there.

If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point.

The Floating-Point Guide

So a fixed-point scheme might be to use a string of 8 decimal digits with the decimal point in the middle, whereby "" would represent In scientific notation , the given number is scaled by a power of 10 , so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit.

The scaling factor, as a power of ten, is then indicated separately at the end of the number. Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:.

Decimal to Floating Point Conversion

To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent , equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative. In storing such a number, the base 10 need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

The occasions on which infinite expansions occur depend on the base and its prime factors. The way in which the significand including its sign and exponent are stored in a computer is implementation-dependent. In this binary expansion, let us denote the positions from 0 leftmost bit, or most significant bit to 32 rightmost bit.

It is used to round the bit approximation to the nearest bit number there are specific rules for halfway values , which is not the case here. When this is stored in memory using the IEEE encoding, this becomes the significand s. The significand is assumed to have a binary point to the right of the leftmost bit. It can be required that the most significant digit of the significand of a non-zero number be non-zero except when the corresponding exponent would be smaller than the minimum one.

This process is called normalization. Therefore, it does not need to be represented in memory; allowing the format to have one more bit of precision. This rule is variously called the leading bit convention , the implicit bit convention , the hidden bit convention , [3] or the assumed bit convention. The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:.

In , Leonardo Torres y Quevedo designed an electro-mechanical version of Charles Babbage 's Analytical Engine , and included floating-point arithmetic. The first commercial computer with floating-point hardware was Zuse's Z4 computer, designed in — The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.

The mass-produced IBM followed in ; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have " scientific computation " SC capability see also Extensions for Scientific Computation XSC.

Book Search

It was not until the launch of the Intel i in that general-purpose personal computers had floating-point capability in hardware as a standard feature. Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations.

Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early s, leading to the creation of the IEEE standard once the bit or bit word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i numerical coprocessor; Motorola, which was designing the around the same time, gave significant input as well.

In , mathematician and computer scientist William Kahan was honored with the Turing Award for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor Harold Stone. A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significant range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.

On a typical computer system, a double precision bit binary floating-point number has a coefficient of 53 bits including 1 implied bit , an exponent of 11 bits, and 1 sign bit. The number of normalized floating-point numbers in a system B , P , L , U where. Namely, positive and negative zeros , as well as denormalized numbers. IEC in This first standard is followed by almost all modern machines. It was revised in The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats and others are termed extended formats ; three of these are especially widely used in computer hardware and languages:.

Increasing the precision of the floating point representation generally reduces the amount of accumulated round-off error caused by intermediate calculations. Any integer with absolute value less than 2 24 can be exactly represented in the single precision format, and any integer with absolute value less than 2 53 can be exactly represented in the double precision format.

Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get bit integers on platforms that have double precision floats but only bit integers. The standard specifies some special values, and their representation: Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself.

Finite floating-point numbers are ordered in the same way as their values in the set of real numbers. Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand or mantissa, from left to right. For the IEEE binary formats basic and extended which have extant hardware implementations, they are apportioned as follows:. While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it.

Values of all 0s in this field are reserved for the zeros and subnormal numbers ; values of all 1s are reserved for the infinities and NaNs. Normalized numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has The sum of the exponent bias and the exponent 1 is , so this is represented in single precision format as. This interpretation is useful for visualizing how the values of floating-point numbers vary with the representation, and allow for certain efficient approximations of floating-point operations by integer operations and bit shifts.

For example, reinterpreting a float as an integer, taking the negative or rather subtracting from a fixed number, due to bias and implicit 1 , then reinterpreting as a float yields the reciprocal. Explicitly, ignoring significand, taking the reciprocal is just taking the additive inverse of the unbiased exponent, since the exponent of the reciprocal is the negative of the original exponent. Hence actually subtracting the exponent from twice the bias, which corresponds to unbiasing, taking negative, and then biasing.

For the significand, near 1 the reciprocal is approximately linear: More significantly, bit shifting allows one to compute the square shift left by 1 or take the square root shift right by 1. This leads to approximate computations of the square root ; combined with the previous technique for taking the inverse, this allows the fast inverse square root computation, which was important in graphics processing in the late s and s.

This can be exploited in some other applications, such as volume ramping in digital sound processing. This holds even for the last step from a given exponent, where the significand overflows into the exponent: Thus as a graph it is linear pieces as the significand grows for a given exponent connecting the evenly spaced powers of two when the significand is 0 , with each linear piece having twice the slope of the previous: Each piece takes the same horizontal space, but twice the vertical space of the last.

Floating Point Numbers

Because the exponent is convex up, the value is always greater than or equal to the actual shifted and scaled exponential curve through the points with significand 0; by a slightly different shift one can more closely approximate an exponential, sometimes overestimating, sometimes underestimating. Conversely, interpreting a floating-point number as an integer gives an approximate shifted and scaled logarithm, with each piece having half the slope of the last, taking the same vertical space but twice the horizontal space.

Since the logarithm is convex down, the approximation is always less than the corresponding logarithmic curve; again, a different choice of scale and shift as at above right yields a closer approximation. In most run-time environments , positive zero is usually printed as "0" and the negative zero as "-0". As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. Subnormal values fill the underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap.

This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero flush to zero. Modern floating-point hardware usually handles subnormal values as well as normal values , and does not require software emulation for subnormals.

The infinities of the extended real number line can be represented in IEEE floating-point datatypes, just like ordinary floating-point values like 1, 1. They are not error values in any way, though they are often but not always, as it depends on the rounding used as replacement values when there is an overflow.

Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. In general, NaNs will be propagated i. There are two kinds of NaNs: A signaling NaN in any arithmetic operation including numerical comparisons will cause an "invalid" exception to be signaled. The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding.

In theory, signaling NaNs could be used by a runtime system to flag uninitialized variables, or extend the floating-point numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common.

Floating Point Numbers

It is a common misconception that the more esoteric features of the IEEE standard discussed here, such as extended formats, NaN, infinities, subnormals etc. The facts are quite the opposite. In those features were designed into the Intel to serve the widest possible market Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard , moderately tolerant of well-meaning ignorance among programmers".

By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base for example, a terminating decimal expansion in base, or a terminating binary expansion in base The number of digits or bits of precision also limits the set of rational numbers that can be represented exactly. For example, the number cannot be exactly represented if only eight decimal digits of precision are available. When a number is represented in some format such as a character string which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation.

If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the rounded value. Whether or not a rational number has a terminating expansion depends on the base.

Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0. This has a decimal value of. The result of rounding differs from the true value by about 0. The difference is the discretization error and is limited by the machine epsilon. The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place ULP.

For example, if there is no representable number lying between the representable numbers 1. For numbers with a base-2 exponent part of 0, i. Rounding is used when the exact result of a floating-point operation or a conversion to floating-point format would need more digits than there are digits in the significand.

IEEE requires correct rounding: There are several different rounding schemes or rounding modes. Historically, truncation was the typical approach. Since the introduction of IEEE , the default method round to nearest, ties to even , sometimes called Banker's Rounding is more commonly used. This method rounds the ideal infinitely precise result of an arithmetic operation to the nearest representable value, and gives that representation as the result.

The IEEE standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric non-NaN result. It means that the results of IEEE operations are completely determined in all bits of the result, except for the representation of NaNs. Alternative rounding options are also available. IEEE specifies the following rounding modes:. Alternative modes are useful when the amount of error being introduced must be bounded.

Applications that require a bounded error are multi-precision floating-point, and interval arithmetic. The alternative rounding modes are also useful in diagnosing numerical instability: For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples, as in the IEEE decimal32 format. The fundamental principles are the same in any radix or precision, except that normalization is optional it does not affect the numerical value of the result.

Here, s denotes the significand and e denotes the exponent. A simple method to add floating-point numbers is to first represent them with the same exponent.