Toby Opferman http://www.opferman.net programming@opferman.net IEEE Floating Point In this simple tutorial we will learn IEEE floating point format for extended, double and single precision. Also, how to convert to and from these formats. Before you read this I assume you can convert whole binary numbers to decimal. This tutor will teach you how to convert real numbers to floating point, but that is just beyond the decimal, the whole number is still the same conversion so you should read the number base tutorial if you do not know how already. Single Precision is 32 bits (4 Bytes) Double Precision is 64 bits (8 Bytes) Extended Precision is 80 bits (10 Bytes) [ 1 Sign Bit | 8 Bit Exponent | 23 Bit Mantissa ] [ 1 Sign Bit | 11 Bit Exponent | 53 Bit Mantissa ] [ 1 Sign Bit | 15 Bit Exponent | 64 Bit Mantissa ] Sign Bit is 1 = Negative, 0 = Positive The next represent 5 different numbers in the 3 different IEEE standards: 1.0 2.0 0.0 1.08 10.333 3F 80 00 00 40 00 00 00 00 00 00 00 3F 8A 3D 71 41 25 53 F8 3F F0 00 00 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3F F1 47 AE 14 7A E1 47 40 24 AA 7E F9 DB 22 D1 3F FF 80 00 00 00 00 00 00 00 40 00 80 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3F FF 8A 3D 70 A3 D7 0A 3D 71 40 02 A5 53 F7 CE D9 16 87 2B Single Precision The Exponet is stored in excess 127 and the mantissa is 1.xxxx 3F 80 00 00 Sign Bit Exp 1.Mantissa 0 01111111 00000000000000000000000 127 - 127 = 0 1.0 bitshift 0 places Exponent = 0, so the number is 1.0 Double Precision The Exponet is stored in excess 127 and the mantissa is 1.xxxx 3F F0 00 00 00 00 00 00 Sign Bit Exp 1.Mantissa 0 01111111111 0000000000000000000000000000000000000000000000000000 Exponent stored Excess 1023 1023 - 1023 = 0 1.0 bitshift 0 places 1.0 is the answer. Extended Precision 3F FF 80 00 00 00 00 00 00 00 Sign Bit Exp Mantissa 0 011111111111111 1000000000000000000000000000000000000000000000000000000000000000 Excess 65535 16383 - 16383 = 0 1.0 bitshift 0 places 1.0 is the answer. Single Precision: 40 00 00 00 Sign Bit Exp 1.Mantissa 0 10000000 00000000000000000000000 128 - 127 = 1 1.0 bitshift 1 place to 10.0 the answer is 2.0 Now, you can see the others are the same and the next one is obviously 0. But, now it's time to take the Mantissa out and find out what it is. 3F 8A 3D 71 Sign Bit Exp 1.Mantissa 0 01111111 00010100011110101110001 Well, we know the exponent is 0 obviously since we just did the last one that way. Now, to get the number it's almost the same as when you convert regular binary to hex, with a small difference. But, instead of each bit reprsenting positive powers of 2, they represent negative powers of 2 (Starting Left to Right) 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 So, you add up the powers of 2 that aren't 0. (You multiply it with the bit, if the bit is 0, you will get 0 so only add up the ones with a set bit) 1 1 1 1 1 1 1 1 1 1 1 -4 -6 -10 -11 -12 -13 -15 -17 -18 -19 -23 2^-4 + 2^-6 + 2^-10 + 2^-11 + 2^-12 + 2^-13 + 2^-15 + 2^-17 + 2^-18 + 2^-19 + 2^-23 .080000042915 1.080000042915 * 2^1 = 1.080000042915 You are going to have trailing numbers. To convert TO IEEE you do the following: You divide the number by 2^-1 and each whole number are the bits. Then you take off the whole number and divide the decimal again. .08/2^-1 = 0.16 1 .16/2^-1 = 0.32 2 .32/2^-1 = 0.64 3 .64/2^-1 = 1.28 4 .28/2^-1 = 0.56 5 .56/2^-1 = 1.12 6 .12/2^-1 = 0.24 7 .24/2^-1 = 0.48 8 .48/2^-1 = 0.96 9 .96/2^-1 = 1.92 10 .92/2^-1 = 1.84 11 .84/2^-1 = 1.68 12 .68/2^-1 = 1.36 13 .36/2^-1 = 0.72 14 .72/2^-1 = 1.44 15 .44/2^-1 = 0.88 16 .88/2^-1 = 1.76 17 .76/2^-1 = 1.52 18 .52/2^-1 = 1.04 19 .04/2^-1 = 0.08 20 .08/2^-1 = 0.16 21 .16/2^-1 = 0.32 22 .32/2^-1 = 0.64 23 .64/2^-1 = 1.28 24 Number Bits 0.16 1 0.32 2 0.64 3 1.28 4 0.56 5 1.12 6 0.24 7 0.48 8 0.96 9 1.92 10 1.84 11 1.68 12 1.36 13 0.72 14 1.44 15 0.88 16 1.76 17 1.52 18 1.04 19 0.08 20 0.16 21 0.32 22 0.64 23 1.28 24 Notice that the whole numbers spell out the binary for the positions. With 1 exception. We have a 0 in the 23 bit place where in the binary above they have a 1. This is because they took it out to 24 places like we did above, and rounded. Since there is a 1, we round to a 1 in the 23 bit place. Therefore, We have gotten the same. Now, we do the same to the whole numbers and we have: 1.00010100011110101110001 Now, we know we need to get it into power of 2 form. But, it looks like it's already there. So, we knock off the 1 and keep the 0001010001111010111000100010100011110101110001 and we just put down 127 so 127 - 127 = 0 shifts. sign bit is 0 as well. 10.333 We will decode each of these, the double precision and the extended precision. ---------------------------------------------- Double Precision 40 24 AA 7E F9 DB 22 D1 01000000 00100100 10101010 01111110 11111001 11011011 00100010 11010001 0 10000000010 0100101010100111111011111001110110110010001011010001 10000000010 = 1026 1026 - 1023 = 3 Remeber, all expoents are stored in EXCESS, so you subtract your exponent FROM the excess to get the shit. Remeber also, Negative shift means shift the decimal to the left and positive shift means shift the decimal to the right. Only after the shift do you start counting mantissa positions. Insert implied 1. 1.0100101010100111111011111001110110110010001011010001 Shift 3 places 1010.0101010100111111011111001110110110010001011010001 The whole number is 10. (1010b = Ah = 10) The mantissa. 0101010100111111011111001110110110010001011010001 Find the bit positions with 1 2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 49 2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 + 2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 + 2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 + 2^-30 + 2^-32 + 2^-33 + 2^-36 + 2^-40 + 2^-42 + 2^-43 + 2^-45 + 2^-49 =.333 Answer is 10.333 ------------------------------------------------------- Extended Precision 40 02 A5 53 F7 CE D9 16 87 2B 0100 0000 0000 0010 1010 0101 0101 0011 1111 0111 1100 1110 1101 1001 0001 0110 1000 0111 0010 1011 0 100000000000010 1010010101010011111101111100111011011001000101101000011100101011 100000000000010 = 16386 16386 - 16383 = 3 So, you have 1.010010101010011111101111100111011011001000101101000011100101011 Move the decimal 3 places 1010.010101010011111101111100111011011001000101101000011100101011 Now, you will notice from this equation and the previous equation with the extended precsion. the first bit in the Mantissa is actually the whole number. 1.xxxxx So, the mantissa is actually 63 bits long with 1 bit being the whole number, so 64 bits. Where as in the other forms, single and double, the 1 isn't written into the mantissa, it's implied to be there. Now, if we look at the part above the decimal point, we see it's 10. 10.xxxx Now, we need to multiply out the powers of 2^-n and add. mbitn = mantissa bit #n from left to right. n You can say the mantissa is Summation(mbitn*2^-n) i=1 Mantissa: 010101010011111101111100111011011001000101101000011100101011 The 1 is in bit positions: 2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 50, 51, 52, 55. 57, 59, 60 So, 2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 + 2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 + 2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 + 2^-30 + 2^-32 + 2^-33 + 2^-36 + 2^-40 + 2^-42 + 2^-43 + 2^-45 + 2^-50 + 2^-51 + 2^-52 + 2^-55 + 2^-57 + 2^-59 + 2^-60 = .333 Answer is 10.333 Now, you see how the IEEE floating point format works in Single Precision, double precision and Extended precision. The only difference betsize the size of the exponent and mantissa between single/double and extended is that single and double precisions have a bit 1.Mantissa that is not in the format itself where in the extended format, the 1 bit is actually IN the mantissa as the first bit and the decimal place is implied to be there. And you notice again that the double precison rounded bit 50 to bit 49. Single precision done on the FPU and double precision done on the FPU should be decently accurate since the FPU of the PC is an 80 bit processor. Extended bit math does NOT have overflow like the other two. It goes to bit 80 and there is no overflow math. So, Extended floating point numbers aren't always extremely accurate to long decimal places, they may only be as accurate as the double precision. Then again, you do have more places and it may help to even have an approximation of the end. But, just remeber, the FPU overflows to 80 bits, so single precision and double have good rounding approximations. That is the end of the tutorial. You see the format, we have decoded the format and even went to the format on one occasion. So, you should understand how to convert numbers to and from IEEE to single/double/extended floating point standards.