IBM PENTIUM STUDY Summary: IBM Research focused on the likelihood of error on the Pentium chip in everyday floating point division activities. Intel has analyzed the probability of making an error based on the assumption that any possible 64 bit pattern is equally likely to occur in both the numerator and the denominator. If that were the case, then the chances of the error would be 1 in 9 billion. They also estimate that an average spreadsheet user will perform 1,000 floating point division per day. Based on these assumptions, Intel estimates that an error will be encountered only once in 9 million days (or once in about 27,000 years). Our analysis shows that the chances of an error occurring are significantly greater when we perform simulations of the types of calculations performed by financial spreadsheet users, because, in this case, all bit patterns are not equally probable. Probability Tests: We have analyzed a Pentium chip in order to understand the sources of errors and have found that in order for an error to occur, both the numerator and denominator must have certain "bad" bit patterns, as described below. First, for the denominator to be "at risk", that is, capable of producing an error with certain numerators, it must contain a long string of consecutive 1's in its binary representation. Although such numbers represent only a very small fraction of all possible numbers, they do occur much more frequently when denominators are created by adding, subtracting, multiplying or dividing simple numbers. For example, the number 3.0 represented exactly, does not have that pattern, but the result of the computation 4.1-1.1 does have that pattern. How many denominators produced in this fashion can be "at risk" that is, capable of producing an error for certain numerators? When we randomly added or subtracted ten random numbers having a single digit dollar amount and two digit in cents, for example, $4.57, then one out of every 300 of the results was "at risk" and hence capable of producing an error. If we repeated the test with numbers having two digit dollar amounts and two digits in cents, then one out of every 2,000 could cause an error. If the denominator was calculated by dividing two numbers having one digit to the left and one to the right of the decimal point, then approximately one in every 200 could cause an error. For simplicity, suppose that one of every 1000 denominators produced by some calculations was "at risk." Now, suppose we have created a bad denominator. What is the chance of now encountering a bad numerator, which will produce an error? It depends on the actual value of the "at risk" denominator, but based on our tests, a conservative estimate would be that only one out of every 100,000 numerators causes a problem. Finally, when we combine the chances of a bad numerator and the chances of a bad denominator, the result is that one out of every 100 million divisions will give a bad result. Our conclusion is vastly different from Intel's. Frequency Tests: We also questioned Intel's analysis and assumption that spreadsheet users will only perform 1,000 divides in a day. Tests run independently suggest that a spreadsheet user (Lotus 1-2-3) does about 5,000 divides every second when he is calculating his spreadsheet. Even if he does this for only 15 minutes a day, he will perform 4.2 million divides in a day, and according to our probability findings, on average,, a computer could make a mistake every 24 days. Hypothetically, if 100,000 Pentium customers were doing 15 minutes of calculations every day, we could expect 4,000 mistakes to occur each day. Conclusion: The Pentium processor does make errors on floating point divisions when both the numerator and denominator of the division have certain characteristics. Our study is an analysis based on probabilities and chances. In reality, a user could face either significantly more errors or no errors at all. If an error occurs, it will first appear in the fifth or higher significant digit and it may have no effect or it may have catastrophic effects. Additional Technical Detail: Some Experiments on Pentium Using Decimal Numbers According to an Intel white paper, if you were to choose a random binary bit pattern for numerator and the denominator, the probability of error in divide is about 1 in 9 billion. The error occurs when certain divisors (termed "at risk" or bad) are divided into certain numerators. In order for the error to occur, our belief is that divisors must lie in a certain range. For each such denominator, there is a range of numerator values which produce an incorrect result. An example of affected numbers is the decimal constants we hardwire in our programs. For example, if converting from months to years and we are interested in 7-8 decimal digits of accuracy, then we can hard wire a constant to convert from months to years. alpha = 1/12 = .083333333 Let us construct a hypothetical example. We have contracted a job which is expected to last 22.5 months. The total value of the contract is $96000. From this, tax at the rate of 14 and 2/3 percent rate has to be deducted. The taxing authority has defined 14 and 2/3 percent to be 14.66667. We want to calculate the net take at a per annum basis. We do the following calculations. Tax = 96000*.1466667 = 14080.0032 Net take home money = 96000 - 14080.0032 = 81919.9968 The number of years in 22.5 months = 22.5*.083333333 = 1.8749999925 years Net take home money per annum = 81919.9968/1.874999925 = $43690.6667 Most machines give the above answer which satisfies the desired 7-8 digit accuracy criterion. On Pentium, the answer is $43690.53, which has only 5 correct digits. In this example, both numerator and denominator are bad numbers. They are both near some simple integer boundary in their binary presentation and as you rightly observed, these numbers occur in real world at a much higher frequency compared to the totally random bit pattern hypothesis. Probabilistic Analysis We are addressing the question of how likely it is to have a bad divisor. On Pentium, a bad divisor belongs to one of the five bad table entries characterized by 1.0001, 1.0100, 1.0111, 1.1010, and 1.1101, followed by a string of l's in the mantissa. We have found that if the string of 1's is of length 20 or so, then it is a bad divisor. Given a bad divisor, the probability of making an error in the division increases dramatically, compared to the 1 in 9 billion figure quoted by Intel. We did some simple experiments using decimal numbers and the findings are reported below. We counted only those bad divisors which belong to one of the above five table values, followed by a string of 32 l's. Intel people argue that all binary patterns are equally likely. If that was really the case, the probability of finding a bad divisor, as defined above, will be 5/(2**36) or about one in 13 billion random divisors. However, we are finding the probabilities to be much higher. Addition/Subtraction of Decimal Numbers In this experiment, we randomly added or subtracted, 10 uniformly distributed random numbers having one or two decimal digits (as in dollars and cents) and then we examined the result for the above binary patterns. Here are the results for two cases. In the first case, we chose only one digit to the left of decimal (as in $3.47) and in the second case, we chose two digits to the left of the decimal (as in $29.56). All the digits were chosen randomly with uniform probability. In the third case, we chose one digit to the right of the decimal point and two digits to the left. The results below give the number of times the result of this experiment has the bit pattern corresponding to a bad divisor. Case 1 (one digit to the left, two to the right) --- 188 out of 100,000 Case 2 (two digits to the left, two to the right) --- 45 out of 100,000 Case 3 (two digits to the left, one to the right) --- 356 out of 100,000 Clearly, these probabilities are much higher than those obtained with the random bit pattern hypothesis. Division Of Two Decimal Numbers: These experiments were conducted through exhaustive tests on all possible digits patterns. Here (a.b)/(c.d) represents division of a two digit (one to the left of the decimal point and one to the right of the decimal point) number by another two digit number. (a.b)/(c.d) - 44 out of 10,000 (O.ab)/(O-cd) - 27 out of 10,000 (a.bc)/(d.ef) - 344 out of 1,000,000 (ab.c)/(de.f) - 422 out of 1,000,000 Multiplication of Two Numbers Here we are multiplying a decimal number by another number which was computed as a reciprocal of another decimal number as in scaling by a constant. (a.b) * (1/(c.d)) - 37 out of 10,000 (a.bc) * (1/(d.e)) - 139 out of 100,000 (a.bc) * (1/(d.ef)) - 434 out of 1,000,000 To summarize, for the decimal calculations of the type given above, the probability of having a result which falls into the category of being a bad divisor is rather high. It appears to be somewhere between 1 in 3000 to 1 in 250. Let us say that it is of the order of 1 in 1000. Furthermore, if the rounding mode corresponds to truncate, the probability of arriving at bad divisors increases significantly. The Dependency on Numerator Given a bad divisor, the divide error occurs for some range of values of the denominator. If we were to take a totally random bit pattern for the denominator, the probability of error appears to be of the order one in 100,000. This is a first cut rough estimate and probably could be improved. It appears that probabilities are different for different table values. The table corresponding to '1.0001' seems to have the most error. For numerator also, there are bands of values where the error is much more likely. Again these bands are more prominent near whole numbers. For example. if we were using (19.4 - 10.4) = '9' as a divisor (a bad one) , and you picked a random value between 6 and .6.01, as the numerator then the chance of error increases to about one in 1000. For the purpose of our simplistic analysis, we will use the figure of 1 in 100,000 for a bad numerator. This assumes that we are picking up a random numerator. Using the value of 1 in 1000 as the probability for a bad divisor, the overall probability for a 'typical' divide being incorrect seems to be of the order of 1 in l00 millions. This is about two orders of magnitude higher compared to the Intel estimate of 1 in 9 billion. Probability of a Divide Instruction Let us assume that a Pentium operating at 90 MHz does an op in 1. 2 cycles on the average. That will give about 75 Million ops per second of actual compute time. We will use a figure of 1 divide per 16, 000 instructions, even though many estimates suggest a much higher frequency of divide. Thus using this conservative estimate of one divide per 16,000 instructions, we come up at about 4687 divides per second. Let us further assume that a typical spread-sheet user does only about 15 minutes of actual intensive computing per day. Then, he is likely to do 4687*900 = 4.2 million divides per day. Assuming an error rate of I in l00 million, it will take about 24 days for an error to occur for an individual user. Combine this with the fact that there are millions of PENTIUM users worldwide, we quickly come to the conclusion that on a typical day a large number of people are making mistakes in their computation without realizing it. IBM Corporation ibmstudy@watson.ibm.com