E-Book, Englisch, 579 Seiten
Muller / Brisebarre / De Dinechin Handbook of Floating-Point Arithmetic
1. Auflage 2009
ISBN: 978-0-8176-4705-6
Verlag: Birkhäuser Boston
Format: PDF
Kopierschutz: 1 - PDF Watermark
E-Book, Englisch, 579 Seiten
ISBN: 978-0-8176-4705-6
Verlag: Birkhäuser Boston
Format: PDF
Kopierschutz: 1 - PDF Watermark
Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program. The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;14
2;List of Figures;16
3;List of Tables;19
4;I Introduction, Basic Definitions, and Standards;22
4.1;1 Introduction;23
4.1.1;1.1 Some History;23
4.1.2;1.2 Desirable Properties;26
4.1.3;1.3 Some Strange Behaviors;27
4.1.3.1;1.3.1 Some famous bugs;27
4.1.3.2;1.3.2 Difficult problems;28
4.2;2 Definitions and Basic Notions;33
4.2.1;2.1 Floating-Point Numbers;33
4.2.2;2.2 Rounding;40
4.2.2.1;2.2.1 Rounding modes;40
4.2.2.2;2.2.2 Useful properties;42
4.2.2.3;2.2.3 Relative error due to rounding;43
4.2.3;2.3 Exceptions;45
4.2.4;2.4 Lost or Preserved Properties of the Arithmetic on the Real Numbers;47
4.2.5;2.5 Note on the Choice of the Radix;49
4.2.5.1;2.5.1 Representation errors;49
4.2.5.2;2.5.2 A case for radix 10;50
4.2.6;2.6 Tools for Manipulating Floating-Point Errors;52
4.2.6.1;2.6.1 The ulp function;52
4.2.6.2;2.6.2 Errors in ulps and relative errors;57
4.2.6.3;2.6.3 An example: iterated products;57
4.2.6.4;2.6.4 Unit roundoff;59
4.2.7;2.7 Note on Radix Conversion;60
4.2.7.1;2.7.1 Conditions on the formats;60
4.2.7.2;2.7.2 Conversion algorithms;63
4.2.8;2.8 The Fused Multiply-Add (FMA) Instruction;71
4.2.9;2.9 Interval Arithmetic;71
4.2.9.1;2.9.1 Intervals with floating-point bounds;72
4.2.9.2;2.9.2 Optimized rounding;72
4.3;3 Floating-Point Formats and Environment;74
4.3.1;3.1 The IEEE 754-1985 Standard;75
4.3.1.1;3.1.1 Formats specified by IEEE 754-1985;75
4.3.1.2;3.1.2 Little-endian, big-endian;79
4.3.1.3;3.1.3 Rounding modes specified by IEEE 754-1985;80
4.3.1.4;3.1.4 Operations specified by IEEE 754-1985;81
4.3.1.5;3.1.5 Exceptions specified by IEEE 754-1985;85
4.3.1.6;3.1.6 Special values;88
4.3.2;3.2 The IEEE 854-1987 Standard;89
4.3.2.1;3.2.1 Constraints internal to a format;89
4.3.2.2;3.2.2 Various formats and the constraints between them;90
4.3.2.3;3.2.3 Conversions between floating-point numbers and decimal strings;91
4.3.2.4;3.2.4 Rounding;92
4.3.2.5;3.2.5 Operations;92
4.3.2.6;3.2.6 Comparisons;93
4.3.2.7;3.2.7 Exceptions;93
4.3.3;3.3 The Need for a Revision;93
4.3.3.1;3.3.1 A typical problem: ``double rounding'';94
4.3.3.2;3.3.2 Various ambiguities;96
4.3.4;3.4 The New IEEE 754-2008 Standard;98
4.3.4.1;3.4.1 Formats specified by the revised standard;99
4.3.4.2;3.4.2 Binary interchange format encodings;100
4.3.4.3;3.4.3 Decimal interchange format encodings;101
4.3.4.4;3.4.4 Larger formats;111
4.3.4.5;3.4.5 Extended and extendable precisions;111
4.3.4.6;3.4.6 Attributes;112
4.3.4.7;3.4.7 Operations specified by the standard;116
4.3.4.8;3.4.8 Comparisons;118
4.3.4.9;3.4.9 Conversions;118
4.3.4.10;3.4.10 Default exception handling;119
4.3.4.11;3.4.11 Recommended transcendental functions;122
4.3.5;3.5 Floating-Point Hardware in Current Processors;123
4.3.5.1;3.5.1 The common hardware denominator;123
4.3.5.2;3.5.2 Fused multiply-add;123
4.3.5.3;3.5.3 Extended precision;123
4.3.5.4;3.5.4 Rounding and precision control;124
4.3.5.5;3.5.5 SIMD instructions;125
4.3.5.6;3.5.6 Floating-point on x86 processors: SSE2 versus x87;125
4.3.5.7;3.5.7 Decimal arithmetic;126
4.3.6;3.6 Floating-Point Hardware in Recent GraphicsProcessing Units;127
4.3.7;3.7 Relations with Programming Languages;128
4.3.7.1;3.7.1 The Language Independent Arithmetic (LIA) standard;128
4.3.7.2;3.7.2 Programming languages;129
4.3.8;3.8 Checking the Environment;129
4.3.8.1;3.8.1 MACHAR;130
4.3.8.2;3.8.2 Paranoia;130
4.3.8.3;3.8.3 UCBTest;134
4.3.8.4;3.8.4 TestFloat;135
4.3.8.5;3.8.5 IeeeCC754;135
4.3.8.6;3.8.6 Miscellaneous;135
5;II Cleverly Using Floating-Point Arithmetic;136
5.1;4 Basic Properties and Algorithms;137
5.1.1;4.1 Testing the Computational Environment;137
5.1.1.1;4.1.1 Computing the radix;137
5.1.1.2;4.1.2 Computing the precision;139
5.1.2;4.2 Exact Operations;140
5.1.2.1;4.2.1 Exact addition;140
5.1.2.2;4.2.2 Exact multiplications and divisions;142
5.1.3;4.3 Accurate Computations of Sums of Two Numbers;143
5.1.3.1;4.3.1 The Fast2Sum algorithm;144
5.1.3.2;4.3.2 The 2Sum algorithm;147
5.1.3.3;4.3.3 If we do not use rounding to nearest;149
5.1.4;4.4 Computation of Products;150
5.1.4.1;4.4.1 Veltkamp splitting;150
5.1.4.2;4.4.2 Dekker's multiplication algorithm;153
5.1.5;4.5 Complex numbers;157
5.1.5.1;4.5.1 Various error bounds;158
5.1.5.2;4.5.2 Error bound for complex multiplication;159
5.1.5.3;4.5.3 Complex division;162
5.1.5.4;4.5.4 Complex square root;167
5.2;5 The Fused Multiply-Add Instruction;169
5.2.1;5.1 The 2MultFMA Algorithm;170
5.2.2;5.2 Computation of Residuals of Division and Square Root;171
5.2.3;5.3 Newton--Raphson-Based Division with an FMA;173
5.2.3.1;5.3.1 Variants of the Newton--Raphson iteration;173
5.2.3.2;5.3.2 Using the Newton--Raphson iteration for correctly rounded division;178
5.2.4;5.4 Newton--Raphson-Based Square Root with an FMA;185
5.2.4.1;5.4.1 The basic iterations;185
5.2.4.2;5.4.2 Using the Newton--Raphson iteration for correctly rounded square roots;186
5.2.5;5.5 Multiplication by an Arbitrary-Precision Constant;189
5.2.5.1;5.5.1 Checking for a given constant C if Algorithm 5.2 will always work;190
5.2.6;5.6 Evaluation of the Error of an FMA;193
5.2.7;5.7 Evaluation of Integer Powers;195
5.3;6 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values;198
5.3.1;6.1 Preliminaries;199
5.3.1.1;6.1.1 Floating-point arithmetic models;200
5.3.1.2;6.1.2 Notation for error analysis and classical error estimates;201
5.3.1.3;6.1.3 Properties for deriving validated running error bounds;204
5.3.2;6.2 Computing Validated Running Error Bounds;205
5.3.3;6.3 Computing Sums More Accurately;207
5.3.3.1;6.3.1 Reordering the operands, and a bit more;207
5.3.3.2;6.3.2 Compensated sums;209
5.3.3.3;6.3.3 Implementing a ``long accumulator'';216
5.3.3.4;6.3.4 On the sum of three floating-point numbers;216
5.3.4;6.4 Compensated Dot Products;218
5.3.5;6.5 Compensated Polynomial Evaluation;220
5.4;7 Languages and Compilers;222
5.4.1;7.1 A Play with Many Actors;222
5.4.1.1;7.1.1 Floating-point evaluation in programming languages;223
5.4.1.2;7.1.2 Processors, compilers, and operating systems;225
5.4.1.3;7.1.3 In the hands of the programmer;226
5.4.2;7.2 Floating Point in the C Language;226
5.4.2.1;7.2.1 Standard C99 headers and IEEE 754-1985 support;226
5.4.2.2;7.2.2 Types;227
5.4.2.3;7.2.3 Expression evaluation;230
5.4.2.4;7.2.4 Code transformations;233
5.4.2.5;7.2.5 Enabling unsafe optimizations;234
5.4.2.6;7.2.6 Summary: a few horror stories;235
5.4.3;7.3 Floating-Point Arithmetic in the C++ Language;237
5.4.3.1;7.3.1 Semantics;237
5.4.3.2;7.3.2 Numeric limits;238
5.4.3.3;7.3.3 Overloaded functions;239
5.4.4;7.4 FORTRAN Floating Point in a Nutshell;240
5.4.4.1;7.4.1 Philosophy;240
5.4.4.2;7.4.2 IEEE 754 support in FORTRAN;243
5.4.5;7.5 Java Floating Point in a Nutshell;244
5.4.5.1;7.5.1 Philosophy;244
5.4.5.2;7.5.2 Types and classes;245
5.4.5.3;7.5.3 Infinities, NaNs, and signed zeros;247
5.4.5.4;7.5.4 Missing features;248
5.4.5.5;7.5.5 Reproducibility;249
5.4.5.6;7.5.6 The BigDecimal package;250
5.4.6;7.6 Conclusion;251
6;III Implementing Floating-Point Operators;253
6.1;8 Algorithms for the Five Basic Operations;254
6.1.1;8.1 Overview of Basic Operation Implementation;254
6.1.2;8.2 Implementing IEEE 754-2008 Rounding;256
6.1.2.1;8.2.1 Rounding a nonzero finite value with unbounded exponent range;256
6.1.2.2;8.2.2 Overflow;258
6.1.2.3;8.2.3 Underflow and subnormal results;259
6.1.2.4;8.2.4 The inexact exception;260
6.1.2.5;8.2.5 Rounding for actual operations;260
6.1.3;8.3 Floating-Point Addition and Subtraction;261
6.1.3.1;8.3.1 Decimal addition;264
6.1.3.2;8.3.2 Decimal addition using binary encoding;265
6.1.3.3;8.3.3 Subnormal inputs and outputs in binary addition;266
6.1.4;8.4 Floating-Point Multiplication;266
6.1.4.1;8.4.1 Normal case;267
6.1.4.2;8.4.2 Handling subnormal numbers in binary multiplication;267
6.1.4.3;8.4.3 Decimal specifics;268
6.1.5;8.5 Floating-Point Fused Multiply-Add;269
6.1.5.1;8.5.1 Case analysis for normal inputs;269
6.1.5.2;8.5.2 Handling subnormal inputs;273
6.1.5.3;8.5.3 Handling decimal cohorts;274
6.1.5.4;8.5.4 Overview of a binary FMA implementation;274
6.1.6;8.6 Floating-Point Division;277
6.1.6.1;8.6.1 Overview and special cases;277
6.1.6.2;8.6.2 Computing the significand quotient;278
6.1.6.3;8.6.3 Managing subnormal numbers;279
6.1.6.4;8.6.4 The inexact exception;280
6.1.6.5;8.6.5 Decimal specifics;280
6.1.7;8.7 Floating-Point Square Root;280
6.1.7.1;8.7.1 Overview and special cases;280
6.1.7.2;8.7.2 Computing the significand square root;281
6.1.7.3;8.7.3 Managing subnormal numbers;282
6.1.7.4;8.7.4 The inexact exception;282
6.1.7.5;8.7.5 Decimal specifics;282
6.2;9 Hardware Implementation of Floating-Point Arithmetic;283
6.2.1;9.1 Introduction and Context;283
6.2.1.1;9.1.1 Processor internal formats;283
6.2.1.2;9.1.2 Hardware handling of subnormal numbers;284
6.2.1.3;9.1.3 Full-custom VLSI versus reconfigurable circuits;285
6.2.1.4;9.1.4 Hardware decimal arithmetic;286
6.2.1.5;9.1.5 Pipelining;287
6.2.2;9.2 The Primitives and Their Cost;288
6.2.2.1;9.2.1 Integer adders;288
6.2.2.2;9.2.2 Digit-by-integer multiplication in hardware;294
6.2.2.3;9.2.3 Using nonstandard representations of numbers;294
6.2.2.4;9.2.4 Binary integer multiplication;295
6.2.2.5;9.2.5 Decimal integer multiplication;297
6.2.2.6;9.2.6 Shifters;298
6.2.2.7;9.2.7 Leading-zero counters;298
6.2.2.8;9.2.8 Tables and table-based methods for fixed-point function approximation;300
6.2.3;9.3 Binary Floating-Point Addition;302
6.2.3.1;9.3.1 Overview;302
6.2.3.2;9.3.2 A first dual-path architecture;303
6.2.3.3;9.3.3 Leading-zero anticipation;305
6.2.3.4;9.3.4 Probing further on floating-point adders;309
6.2.4;9.4 Binary Floating-Point Multiplication;310
6.2.4.1;9.4.1 Basic architecture;310
6.2.4.2;9.4.2 FPGA implementation;310
6.2.4.3;9.4.3 VLSI implementation optimized for delay;312
6.2.4.4;9.4.4 Managing subnormals;315
6.2.5;9.5 Binary Fused Multiply-Add;316
6.2.5.1;9.5.1 Classic architecture;317
6.2.5.2;9.5.2 To probe further;319
6.2.6;9.6 Division;319
6.2.6.1;9.6.1 Digit-recurrence division;320
6.2.6.2;9.6.2 Decimal division;323
6.2.7;9.7 Conclusion: Beyond the FPU;323
6.2.7.1;9.7.1 Optimization in context of standard operators;324
6.2.7.2;9.7.2 Operation with a constant operand;325
6.2.7.3;9.7.3 Block floating point;327
6.2.7.4;9.7.4 Specific architectures for accumulation;327
6.2.7.5;9.7.5 Coarser-grain operators;331
6.2.8;9.8 Probing Further;334
6.3;10 Software Implementation of Floating-Point Arithmetic;335
6.3.1;10.1 Implementation Context;336
6.3.1.1;10.1.1 Standard encoding of binary floating-point data;336
6.3.1.2;10.1.2 Available integer operators;337
6.3.1.3;10.1.3 First examples;340
6.3.1.4;10.1.4 Design choices and optimizations;342
6.3.2;10.2 Binary Floating-Point Addition;343
6.3.2.1;10.2.1 Handling special values;344
6.3.2.2;10.2.2 Computing the sign of the result;346
6.3.2.3;10.2.3 Swapping the operands and computing the alignment shift;347
6.3.2.4;10.2.4 Getting the correctly rounded result;349
6.3.3;10.3 Binary Floating-Point Multiplication;355
6.3.3.1;10.3.1 Handling special values;355
6.3.3.2;10.3.2 Sign and exponent computation;357
6.3.3.3;10.3.3 Overflow detection;359
6.3.3.4;10.3.4 Getting the correctly rounded result;360
6.3.4;10.4 Binary Floating-Point Division;363
6.3.4.1;10.4.1 Handling special values;364
6.3.4.2;10.4.2 Sign and exponent computation;365
6.3.4.3;10.4.3 Overflow detection;368
6.3.4.4;10.4.4 Getting the correctly rounded result;369
6.3.5;10.5 Binary Floating-Point Square Root;375
6.3.5.1;10.5.1 Handling special values;376
6.3.5.2;10.5.2 Exponent computation;378
6.3.5.3;10.5.3 Getting the correctly rounded result;379
7;IV Elementary Functions;387
7.1;11 Evaluating Floating-Point Elementary Functions;388
7.1.1;11.1 Basic Range Reduction Algorithms;392
7.1.1.1;11.1.1 Cody and Waite's reduction algorithm;392
7.1.1.2;11.1.2 Payne and Hanek's algorithm;394
7.1.2;11.2 Bounding the Relative Error of Range Reduction;395
7.1.3;11.3 More Sophisticated Range Reduction Algorithms;397
7.1.3.1;11.3.1 An example of range reduction for the exponential function;399
7.1.3.2;11.3.2 An example of range reduction for the logarithm;400
7.1.4;11.4 Polynomial or Rational Approximations;401
7.1.4.1;11.4.1 L2 case;402
7.1.4.2;11.4.2 L, or minimax case;403
7.1.4.3;11.4.3 ``Truncated'' approximations;405
7.1.5;11.5 Evaluating Polynomials;406
7.1.6;11.6 Correct Rounding of Elementary Functions tobinary64;407
7.1.6.1;11.6.1 The Table Maker's Dilemma and Ziv's onion peelingstrategy;407
7.1.6.2;11.6.2 When the TMD is solved;408
7.1.6.3;11.6.3 Rounding test;409
7.1.6.4;11.6.4 Accurate second step;413
7.1.6.5;11.6.5 Error analysis and the accuracy/performance tradeoff;414
7.1.7;11.7 Computing Error Bounds;415
7.1.7.1;11.7.1 The point with efficient code;415
7.1.7.2;11.7.2 Example: a ``double-double'' polynomial evaluation;416
7.2;12 Solving the Table Maker's Dilemma;418
7.2.1;12.1 Introduction;418
7.2.1.1;12.1.1 The Table Maker's Dilemma;419
7.2.1.2;12.1.2 Brief history of the TMD;423
7.2.1.3;12.1.3 Organization of the chapter;424
7.2.2;12.2 Preliminary Remarks on the Table Maker's Dilemma;425
7.2.2.1;12.2.1 Statistical arguments: what can be expected in practice;425
7.2.2.2;12.2.2 In some domains, there is no need to find worst cases;429
7.2.2.3;12.2.3 Deducing the worst cases from other functions or domains;432
7.2.3;12.3 The Table Maker's Dilemma for Algebraic Functions;433
7.2.3.1;12.3.1 Algebraic and transcendental numbers and functions;433
7.2.3.2;12.3.2 The elementary case of quotients;435
7.2.3.3;12.3.3 Around Liouville's theorem;437
7.2.3.4;12.3.4 Generating bad rounding cases for the square root using Hensel 2-adic lifting;438
7.2.4;12.4 Solving the Table Maker's Dilemma for Arbitrary Functions;442
7.2.4.1;12.4.1 Lindemann's theorem: application to some transcendental functions;442
7.2.4.2;12.4.2 A theorem of Nesterenko and Waldschmidt;443
7.2.4.3;12.4.3 A first method: tabulated differences;445
7.2.4.4;12.4.4 From the TMD to the distance between a grid and a segment;447
7.2.4.5;12.4.5 Linear approximation: Lefèvre's algorithm;449
7.2.4.6;12.4.6 The SLZ algorithm;456
7.2.4.7;12.4.7 Periodic functions on large arguments;461
7.2.5;12.5 Some Results;462
7.2.5.1;12.5.1 Worst cases for the exponential, logarithmic, trigonometric, and hyperbolic functions;462
7.2.5.2;12.5.2 A special case: integer powers;471
7.2.6;12.6 Current Limits and Perspectives;471
8;V Extensions;473
8.1;13 Formalisms for Certifying Floating-Point Algorithms;474
8.1.1;13.1 Formalizing Floating-Point Arithmetic;474
8.1.1.1;13.1.1 Defining floating-point numbers;475
8.1.1.2;13.1.2 Simplifying the definition;477
8.1.1.3;13.1.3 Defining rounding operators;478
8.1.1.4;13.1.4 Extending the set of numbers;481
8.1.2;13.2 Formalisms for Certifying Algorithms by Hand;482
8.1.2.1;13.2.1 Hardware units;482
8.1.2.2;13.2.2 Low-level algorithms;483
8.1.2.3;13.2.3 Advanced algorithms;484
8.1.3;13.3 Automating Proofs;485
8.1.3.1;13.3.1 Computing on bounds;486
8.1.3.2;13.3.2 Counting digits;488
8.1.3.3;13.3.3 Manipulating expressions;490
8.1.3.4;13.3.4 Handling the relative error;494
8.1.4;13.4 Using Gappa;495
8.1.4.1;13.4.1 Toy implementation of sine;495
8.1.4.2;13.4.2 Integer division on Itanium;499
8.2;14 Extending the Precision;503
8.2.1;14.1 Double-Words, Triple-Words…;504
8.2.1.1;14.1.1 Double-word arithmetic;505
8.2.1.2;14.1.2 Static triple-word arithmetic;508
8.2.1.3;14.1.3 Quad-word arithmetic;510
8.2.2;14.2 Floating-Point Expansions;513
8.2.3;14.3 Floating-Point Numbers with Batched Additional Exponent;519
8.2.4;14.4 Large Precision Relying on Processor Integers;520
8.2.4.1;14.4.1 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic;522
8.2.4.2;14.4.2 A brief introduction to arbitrary-precision integer arithmetic;523
9;VI Perspectives and Appendix;527
9.1;15 Conclusion and Perspectives;528
9.2;16 Appendix: Number Theory Tools for F-P Arithmetic;529
9.2.1;16.1 Continued Fractions;529
9.2.2;16.2 The LLL Algorithm;532
10;Bibliography;537
11;Index;574




