E-Book, Englisch, 698 Seiten
Barlas Multicore and GPU Programming
1. Auflage 2014
ISBN: 978-0-12-417140-4
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
An Integrated Approach
E-Book, Englisch, 698 Seiten
ISBN: 978-0-12-417140-4
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore 'massively parallel' computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today's computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines. - Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA - Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance - Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems - Download source code, examples, and instructor support materials on the book's companion website
Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;Multicore and GPU Programming: An Integrated Approach;4
3;Copyright;5
4;Dedication;6
5;Contents;8
6;List of Tables;14
7;Preface;16
7.1; What Is in This Book;16
7.2; Using This Book as a Textbook;18
7.3; Software and Hardware Requirements;19
7.4; Sample Code;20
8;Chapter 1: Introduction;22
8.1;1.1 The era of multicore machines;22
8.2;1.2 A taxonomy of parallel machines;24
8.3;1.3 A glimpse of contemporary computing machines;26
8.3.1;1.3.1 The cell BE processor ;27
8.3.2;1.3.2 Nvidia's Kepler;28
8.3.3;1.3.3 AMD's APUs;31
8.3.4;1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi ;32
8.4;1.4 Performance metrics;35
8.5;1.5 Predicting and measuring parallel program performance;39
8.5.1;1.5.1 Amdahl's law;42
8.5.2;1.5.2 Gustafson-barsis's rebuttal;45
8.6; Exercises;46
9;Chapter 2: Multicore and parallel program design;48
9.1;2.1 Introduction;48
9.2;2.2 The PCAM methodology;49
9.3;2.3 Decomposition patterns;53
9.3.1;2.3.1 Task parallelism;54
9.3.2;2.3.2 Divide-and-conquer decomposition;55
9.3.3;2.3.3 Geometric decomposition;57
9.3.4;2.3.4 Recursive data decomposition;60
9.3.5;2.3.5 Pipeline decomposition;63
9.3.6;2.3.6 Event-based coordination decomposition;67
9.4;2.4 Program structure patterns;68
9.4.1;2.4.1 Single-program, multiple-data;69
9.4.2;2.4.2 Multiple-program, multiple-data;69
9.4.3;2.4.3 Master-worker;70
9.4.4;2.4.4 Map-reduce;71
9.4.5;2.4.5 Fork/join;72
9.4.6;2.4.6 Loop parallelism;74
9.5;2.5 Matching decomposition patterns with program structure patterns;74
9.6; Exercises;75
10;Chapter 3: Shared-memory programming: threads;76
10.1;3.1 Introduction;76
10.2;3.2 Threads;79
10.2.1;3.2.1 What is a thread?;79
10.2.2;3.2.2 What are threads good for?;80
10.2.3;3.2.3 Thread creation and initialization;80
10.2.3.1;3.2.3.1 Implicit thread creation;84
10.2.4;3.2.4 Sharing data between threads;86
10.3;3.3 Design concerns;89
10.4;3.4 Semaphores;91
10.5;3.5 Applying semaphores in classical problems;96
10.5.1;3.5.1 Producers-consumers;96
10.5.2;3.5.2 Dealing with termination;100
10.5.2.1;3.5.2.1 Termination using a shared data item;100
10.5.2.2;3.5.2.2 Termination using messages;106
10.5.3;3.5.3 The barbershop problem: introducing fairness;111
10.5.4;3.5.4 Readers-writers;116
10.5.4.1;3.5.4.1 A solution favoring the readers;116
10.5.4.2;3.5.4.2 Giving priority to the writers;117
10.5.4.3;3.5.4.3 A fair solution;119
10.6;3.6 Monitors;120
10.6.1;3.6.1 Design approach 1: critical section inside the monitor;124
10.6.2;3.6.2 Design approach 2: monitor controls entry to critical section;125
10.7;3.7 Applying monitors in classical problems;128
10.7.1;3.7.1 Producers-consumers revisited;128
10.7.1.1;3.7.1.1 Producers-consumers: buffer manipulation within the monitor;128
10.7.1.2;3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor;131
10.7.2;3.7.2 Readers-writers;134
10.7.2.1;3.7.2.1 A solution favoring the readers;134
10.7.2.2;3.7.2.2 Giving priority to the writers;136
10.7.2.3;3.7.2.3 A fair solution;137
10.8;3.8 Dynamic vs. static thread management;141
10.8.1;3.8.1 Qt's thread pool;141
10.8.2;3.8.2 Creating and managing a pool of threads;142
10.9;3.9 Debugging multithreaded applications;151
10.10;3.10 Higher-level constructs: multithreaded programming without threads;156
10.10.1;3.10.1 Concurrent map;157
10.10.2;3.10.2 Map-reduce;159
10.10.3;3.10.3 Concurrent filter;161
10.10.4;3.10.4 Filter-reduce;163
10.10.5;3.10.5 A case study: multithreaded sorting;164
10.10.6;3.10.6 A case study: multithreaded image matching;173
10.11; Exercises;180
11;Chapter 4: Shared-memory programming: OpenMP;186
11.1;4.1 Introduction;186
11.2;4.2 Your First OpenMP Program;187
11.3;4.3 Variable Scope;190
11.3.1;4.3.1 OpenMP Integration V.0: Manual Partitioning;192
11.3.2;4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition;194
11.3.3;4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking;196
11.3.4;4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction;197
11.3.5;4.3.5 Final Words on Variable Scope;199
11.4;4.4 Loop-Level Parallelism;200
11.4.1;4.4.1 Data Dependencies;202
11.4.1.1;4.4.1.1 Flow Dependencies;204
11.4.1.2;4.4.1.2 Antidependencies;211
11.4.1.3;4.4.1.3 Output Dependencies;211
11.4.2;4.4.2 Nested Loops;212
11.4.3;4.4.3 Scheduling;213
11.5;4.5 Task Parallelism;216
11.5.1;4.5.1 The sections Directive;217
11.5.1.1;4.5.1.1 Producers-Consumers in OpenMP;218
11.5.2;4.5.2 The task Directive;223
11.6;4.6 Synchronization Constructs;229
11.7;4.7 Correctness and Optimization Issues;237
11.7.1;4.7.1 Thread Safety;237
11.7.2;4.7.2 False Sharing;241
11.8;4.8 A Case Study: Sorting in OpenMP;247
11.8.1;4.8.1 Bottom-Up Mergesort in OpenMP;248
11.8.2;4.8.2 Top-Down Mergesort in OpenMP;251
11.8.3;4.8.3 Performance Comparison;256
11.9; Exercises;257
12;Chapter 5: Distributed memory programming;260
12.1;5.1 Communicating Processes;260
12.2;5.2 MPI;261
12.3;5.3 Core concepts;262
12.4;5.4 Your first MPI program;263
12.5;5.5 Program architecture;267
12.5.1;5.5.1 SPMD;267
12.5.2;5.5.2 MPMD;267
12.6;5.6 Point-to-Point communication;269
12.7;5.7 Alternative Point-to-Point communication modes;273
12.7.1;5.7.1 Buffered Communications;274
12.8;5.8 Non blocking communications;276
12.9;5.9 Point-to-Point Communications: Summary;280
12.10;5.10 Error reporting and handling;280
12.11;5.11 Collective communications;282
12.11.1;5.11.1 Scattering;287
12.11.2;5.11.2 Gathering;293
12.11.3;5.11.3 Reduction;295
12.11.4;5.11.4 All-to-All Gathering;300
12.11.5;5.11.5 All-to-All Scattering;304
12.11.6;5.11.6 All-to-All Reduction;309
12.11.7;5.11.7 Global Synchronization;310
12.12;5.12 Communicating objects;310
12.12.1;5.12.1 Derived Datatypes;311
12.12.2;5.12.2 Packing/Unpacking;318
12.13;5.13 Node management: communicators and groups;321
12.13.1;5.13.1 Creating Groups;321
12.13.2;5.13.2 Creating Intra-Communicators;323
12.14;5.14 One-sided communications;326
12.14.1;5.14.1 RMA Communication Functions;328
12.14.2;5.14.2 RMA Synchronization Functions;329
12.15;5.15 I/O considerations;338
12.16;5.16 Combining MPI processes with threads;346
12.17;5.17 Timing and Performance Measurements;349
12.18;5.18 Debugging and profiling MPI programs;350
12.19;5.19 The Boost.MPI library;354
12.19.1;5.19.1 Blocking and non blocking Communications;356
12.19.2;5.19.2 Data Serialization;361
12.19.3;5.19.3 Collective Operations;364
12.20;5.20 A case study: diffusion-limited aggregation;368
12.21;5.21 A case study: brute-force encryption cracking;373
12.21.1;5.21.1 Version #1 : "plain-vanilla'' MPI;373
12.21.2;5.21.2 Version #2 : combining MPI and OpenMP;379
12.22;5.22 A Case Study: MPI Implementation of the Master-Worker Pattern;383
12.22.1;5.22.1 A Simple Master-Worker Setup;384
12.22.2;5.22.2 A Multithreaded Master-Worker Setup;392
12.23; Exercises;407
13;Chapter 6: GPU programming;412
13.1;6.1 GPU Programming;412
13.2;6.2 CUDA's programming model: threads, blocks, and grids;415
13.3;6.3 CUDA's execution model: streaming multiprocessors and warps;421
13.4;6.4 CUDA compilation process;424
13.5;6.5 Putting together a CUDA project;428
13.6;6.6 Memory hierarchy;431
13.6.1;6.6.1 Local Memory/Registers;437
13.6.2;6.6.2 Shared Memory;438
13.6.3;6.6.3 Constant Memory;446
13.6.4;6.6.4 Texture and Surface Memory;453
13.7;6.7 Optimization techniques;453
13.7.1;6.7.1 Block and Grid Design;453
13.7.2;6.7.2 Kernel Structure;463
13.7.3;6.7.3 Shared Memory Access;467
13.7.4;6.7.4 Global Memory Access;475
13.7.5;6.7.5 Page-Locked and Zero-Copy Memory;479
13.7.6;6.7.6 Unified Memory;482
13.7.7;6.7.7 Asynchronous Execution and Streams;485
13.7.7.1;6.7.7.1 Stream Synchronization: Events and Callbacks;488
13.8;6.8 Dynamic parallelism;492
13.9;6.9 Debugging CUDA programs;496
13.10;6.10 Profiling CUDA programs;497
13.11;6.11 CUDA and MPI;501
13.12;6.12 Case studies;506
13.12.1;6.12.1 Fractal Set Calculation;507
13.12.1.1;6.12.1.1 Version #1: One thread per pixel;508
13.12.1.2;6.12.1.2 Version #2: Pinned host and pitched device memory;511
13.12.1.3;6.12.1.3 Version #3: Multiple pixels per thread;513
13.12.1.4;6.12.1.4 Evaluation;515
13.12.2;6.12.2 Block Cipher Encryption;517
13.12.2.1;6.12.2.1 Version #1: The case of a standalone GPU machine;524
13.12.2.2;6.12.2.2 Version #2: Overlapping GPU communication and computation;531
13.12.2.3;6.12.2.3 Version #3: Using a cluster of GPU machines;533
13.12.2.4;6.12.2.4 Evaluation;540
13.13; Exercises;544
14;Chapter 7: The Thrust template library;548
14.1;7.1 Introduction;548
14.2;7.2 First steps in Thrust;549
14.3;7.3 Working with Thrust datatypes;553
14.4;7.4 Thrust algorithms;556
14.4.1;7.4.1 Transformations;557
14.4.2;7.4.2 Sorting and searching;561
14.4.3;7.4.3 Reductions;567
14.4.4;7.4.4 Scans/prefix sums;569
14.4.5;7.4.5 Data management and manipulation;571
14.5;7.5 Fancy iterators;574
14.6;7.6 Switching device back ends;580
14.7;7.7 Case studies;582
14.7.1;7.7.1 Monte carlo integration;582
14.7.2;7.7.2 DNA Sequence alignment;585
14.8; Exercises;592
15;Chapter 8: Load balancing;596
15.1;8.1 Introduction;596
15.2;8.2 Dynamic load balancing: the Linda legacy;597
15.3;8.3 Static Load Balancing: The Divisible LoadTheory Approach;599
15.3.1;8.3.1 Modeling Costs;600
15.3.2;8.3.2 Communication Configuration;607
15.3.3;8.3.3 Analysis;610
15.3.3.1;8.3.3.1 N-Port, Block-Type, Single-Installment Solution;611
15.3.3.2;8.3.3.2 One-Port, Block-Type, Single-Installment Solution;613
15.3.4;8.3.4 Summary - Short Literature Review;619
15.4;8.4 DLTlib: A library for partitioning workloads;622
15.5;8.5 Case studies;625
15.5.1;8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing;625
15.5.2;8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing;638
16;Appendix A: Compiling Qt programs;650
16.1;A.1 Using an IDE;650
16.2;A.2 The qmake Utility;650
17;Appendix B: Running MPI programs;652
17.1;B.1 Preparatory Steps;652
17.2;B.2 Computing Nodes discovery for MPI Program Deployment;653
17.2.1;B.2.1 Host Discovery with the nmap Utility;653
17.2.2;B.2.2 Automatic Generation of a Hostfile;654
18;Appendix C: Time measurement;656
18.1;C.1 Introduction;656
18.2;C.2 POSIX High-Resolution Timing;656
18.3;C.3 Timing in Qt;658
18.4;C.4 Timing in OpenMP;659
18.5;C.5 Timing in MPI;659
18.6;C.6 Timing in CUDA;659
19;Appendix D: Boost.MPI;662
19.1;D.1 Mapping from MPI C to Boost.MPI;662
20;Appendix E: Setting up CUDA;664
20.1;E.1 Installation;664
20.2;E.2 Issues with GCC;664
20.3;E.3 Running CUDA without an Nvidia GPU;665
20.4;E.4 Running CUDA on Optimus-Equipped Laptops;666
20.5;E.5 Combining CUDA with Third-Party Libraries;667
21;Appendix F: DLTlib;670
21.1;F.1 DLTlib Functions;670
21.1.1;F.1.1 Class Network: Generic Methods;671
21.1.2;F.1.2 Class Network: Query Processing;673
21.1.3;F.1.3 Class Network: Image Processing;674
21.1.4;F.1.4 Class Network: Image Registration;675
21.2;F.2 DLTlib Files;678
22;Glossary;680
23;Bibliography;682
24;Index;686
Preface
Parallel computing has been given a fresh breath of life since the emergence of multicore architectures in the first decade of the new century. The new platforms demand a new approach to software development; one that blends the tools and established practices of the network-of-workstations era with emerging software platforms such as CUDA. This book tries to address this need by covering the dominant contemporary tools and techniques, both in isolation and also most importantly in combination with each other. We strive to provide examples where multiple platforms and programming paradigms (e.g., message passing & threads) are effectively combined. “Hybrid” computation, as it is usually called, is a new trend in high-performance computing, one that could possibly allow software to scale to the “millions of threads” required for exascale performance. All chapters are accompanied by extensive examples and practice problems with an emphasis on putting them to work, while comparing alternative design scenarios. All the little details, which can make the difference between a productive software development and a stressed exercise in futility, are presented in a orderly fashion. The book covers the latest advances in tools that have been inherited from the 1990s (e.g., the OpenMP and MPI standards), but also more cutting-edge platforms, such as the Qt library with its sophisticated thread management and the Thrust template library with its capability to deploy the same software over diverse multicore architectures, including both CPUs and Graphical Processing Units (GPUs). We could never accomplish the feat of covering all the tools available for multicore development today. Even some of the industry-standard ones, like POSIX threads, are omitted. Our goal is to both sample the dominant paradigms (ranging from OpenMP’s semi-automatic parallelization of sequential code to the explicit communication “plumping” that underpins MPI), while at the same time explaining the rationale and how-to, behind efficient multicore program development. What is in this Book
This book can be separated in the following logical units, although no such distinction is made in the text: • Introduction, designing multicore software: Chapter 1 introduces multicore hardware and examines influential instances of this architectural paradigm. Chapter 1 also introduces speedup and efficiency, which are essential metrics used in the evaluation of multicore and parallel software. Amdahl’s law and Gustafson-Barsis’s rebuttal cap-up the chapter, providing estimates of what can be expected from the exciting new developments in multicore and many-core hardware.
Chapter 2 is all about the methodology and the design patterns that can be employed in the development of parallel and multicore software. Both work decomposition patterns and program structure patterns are examined. • Shared-memory programming: Two different approaches for shared-memory parallel programming are examined: explicit and implicit parallelization. On the explicit side, Chapter 3 covers threads and two of the most commonly used synchronization mechanisms, semaphores and monitors. Frequently encountered design patterns, such as producers-consumers and readers-writers, are explained thoroughly and applied in a range of examples. On the implicit side, Chapter 4 covers the OpenMP standard that has been specifically designed for parallelizing existing sequential code with minimum effort. Development time is significantly reduced as a result. There are still complications, such as loop-carried dependencies, which are also addressed. • Distributed memory programming: Chapter 5 introduces the de facto standard for distributed memory parallel programming, i.e., the Message Passing Interface (MPI). MPI is relevant to multicore programming as it is designed to scale from a shared-memory multicore machine to a million-node supercomputer. As such, MPI provides the foundation for utilizing multiple disjoint multicore machines, as a single virtual platform.
The features that are covered include both point-to-point and collective communication, as well as one-sided communication. A section is dedicated to the Boost.MPI library, as it does simplify the proceedings of using MPI, although it is not yet feature-complete. • GPU programming: GPUs are one of the primary reasons why this book was put together. In a similar fashion to shared-memory programming, we examine the problem of developing GPU-specific software from two perspectives: on one hand we have the “nuts-and-bolts” approach of Nvidia’s CUDA, where memory transfers, data placement, and thread execution configuration have to be carefully planned. CUDA is examined in Chapter 6.
On the other hand, we have the high-level, algorithmic approach of the Thrust template library, which is covered in Chapter 7. The STL-like approach to program design affords Thrust the ability to target both CPUs and GPU platforms, a unique feature among the tools we cover. • Load balancing : Chapter 8 is dedicated to an often under-estimated aspect of multicore development. In general, load balancing has to be seriously considered once heterogeneous computing resources come into play. For example, a CPU and a GPU constitute such a set of resources, so we should not think only of clusters of dissimilar machines as fitting this requirement. Chapter 8 briefly discusses the Linda coordination language, which can be considered a high-level abstraction of dynamic load balancing.
The main focus is on static load balancing and the mathematical models that can be used to drive load partitioning and data communication sequences.
A well-established methodology known as Divisible Load Theory (DLT) is explained and applied in a number of scenarios. A simple C++ library that implements parts of the DLT research results, which have been published over the past two decades, is also presented. Using this Book as a Textbook
The material covered in this book is appropriatefor senior undergraduateor postgraduate course work. The required student background includes programming in C, C++ (both languages are used throughout this book), basic operating system concepts, and at least elementary knowledge of computer architecture. Depending on the desired focus, an instructor may choose to follow one of the suggested paths listed below. The first two chapters lay the foundations for the other chapters, so they are included in all sequences: • Emphasis on parallel programming (undergraduate): • Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance metrics. Sections: 1.1–1.5. • Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2.1–2.5. • Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.7. • Chapter 4: OpenMP basics, work-sharing constructs. Sections 4.1–4.8. • Chapter 5: MPI, point-to-point communications, collective operations, bject/structure communications, debugging and profiling. Sections 5.1–5.12, 5.15–5.18, 5.20. • Chapter 6: CUDA programming model, memory hierarchy, GPU-pecific optimizations. Sections 6.1–6.2, 6.7.1, 6.7.3, 6.7.6, 6.9–6.11, 6.12.1. • Chapter 7: Thrust basics. Sections 7.1–7.4. • Chapter 8: Load balancing. Sections 8.1–8.3. • Emphasis on multicore programming (undergraduate): • Chapter 1: Flynn’s taxonomy, contemporary multicore machines, performance metrics. Sections 1.1–1.5. • Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2.1–2.5. • Chapter 3: Threads, semaphores, monitors. Sections 3.1–3.7. • Chapter 4: OpenMP basics, work-sharing constructs, correctness and performance issues. Sections 4.1–4.4. • Chapter 5: MPI, point-to-point communications, collective operations, debugging and profiling. Sections 5.1–5.12, 5.16–5.18, 5.21. • Chapter 6: CUDA programming model, memory hierarchy, GPU-specific optimizations. Sections 6.1–6.10,...