I've read about the difference between double precision and single precision. However, in most cases, float
and double
seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double
has 2x the precision of float
^{[1]}. In general a double
has 15 decimal digits of precision, while float
has 7.
Here's how the number of digits are calculated:
double
has 52 mantissa bits + 1 hidden bit: log(2^{53})÷log(10) = 15.95 digits
float
has 23 mantissa bits + 1 hidden bit: log(2^{24})÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38
, but double is about 1.7e308
, so using float
can hit "infinity" (i.e. a special floatingpoint number) much more easily than double
for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double
isn't accurate enough, hence we sometimes have long double
^{[1]} (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from roundoff errors, so if precision is very important (e.g. money processing) you should use int
or a fraction class.
Furthermore, don't use +=
to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum
. Otherwise, try to implement the Kahan summation algorithm.
^{[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE doubleprecision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE singleprecision floating point number (binary32), and double is a IEEE doubleprecision floating point number (binary64).}

17The usual advice for summation is to sort your floating point numbers by magnitude (smallest first) before summing. Aug 6 '10 at 9:49

1Note that while C/C++ float and double are nearly always IEEE single and double precision respectively C/C++ long double is far more variable depending on your CPU, compiler and OS. Sometimes it's the same as double, sometimes it's some systemspecific extended format, Sometimes it's IEEE quad precision.– plugwashFeb 8 '19 at 5:27


1@InQusitive: Consider for example an array consisting of the value 2^24 followed by 2^24 repetitions of the value 1. Summing in order produces 2^24. Reversing produces 2^25. Of course you can make examples (e.g. make it 2^25 repetitions of 1) where any order ends up being catastrophically wrong with a single accumulator but smallestmagnitudefirst is the best among such. To do better you need some kind of tree. Jan 2 '20 at 15:18

2@R..GitHubSTOPHELPINGICE: summing is even more tricky if the array contains both positive and negative numbers.– chqrlieSep 7 '20 at 8:59
Here is what the standard C99 (ISOIEC 9899 6.2.5 §10) or C++2003 (ISOIEC 148822003 3.1.9 §8) standards say:
There are three floating point types:
float
,double
, andlong double
. The typedouble
provides at least as much precision asfloat
, and the typelong double
provides at least as much precision asdouble
. The set of values of the typefloat
is a subset of the set of values of the typedouble
; the set of values of the typedouble
is a subset of the set of values of the typelong double
.
The C++ standard adds:
The value representation of floatingpoint types is implementationdefined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About FloatingPoint Arithmetic that covers the IEEE floatingpoint standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between 1 and 1 are those with the most precision.
Given a quadratic equation: x^{2} − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r_{1} = 2.000316228 and r_{2} = 1.999683772.
Using float
and double
, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b  4.0*a*c;
double sd = sqrt(d);
double r1 = (b + sd) / (2.0*a);
double r2 = (b  sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b  4.0f*a*c;
float sd = sqrtf(d);
float r1 = (b + sd) / (2.0f*a);
float r2 = (b  sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = 4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = 4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float
.
(In fact, the above is not the best way of solving quadratic equations using either single or doubleprecision floatingpoint numbers, but the answer remains unchanged even if one uses a more stable method.)
 A double is 64 and single precision (float) is 32 bits.
 The double has a bigger mantissa (the integer bits of the real number).
 Any inaccuracies will be smaller in the double.
The size of the numbers involved in the floatpoint calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t
as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.

5just to be sure: the solution of your issue should be to use an int preferably ? If you want to iterate 100 times, you should count with an int rather than using a double– BlueTrinSep 19 '16 at 12:07

8Using
double
is not a good solution here. You useint
to count and do an internal multiplication to get your floatingpoint value.– RichardSep 24 '17 at 23:10
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/ 3.4 * 10^38 or * 10^38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 1216 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About FloatingPoint Arithmetic for better understanding.
There are three floating point types:
 float
 double
 long double
A simple Venn diagram will explain about: The set of values of the types
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The builtin comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Unlike an int
(whole number), a float
have a decimal point, and so can a double
.
But the difference between the two is that a double
is twice as detailed as a float
, meaning that it can have double the amount of numbers after the decimal point.

6It doesn't mean that at all. It actually means twice as many integral decimal digits, and it is more than double. The relationship between fractional digits and precision is not linear: it depends on the value: e.g. 0.5 is precise but 0.33333333333333333333 is not. Sep 24 '17 at 23:34