E-Book, Englisch, 406 Seiten
Catthoor / Raghavan / Lambrechts Ultra-Low Energy Domain-Specific Instruction-Set Processors
1. Auflage 2010
ISBN: 978-90-481-9528-2
Verlag: Springer-Verlag
Format: PDF
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
E-Book, Englisch, 406 Seiten
ISBN: 978-90-481-9528-2
Verlag: Springer-Verlag
Format: PDF
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
Modern consumers carry many electronic devices, like a mobile phone, digital camera, GPS, PDA and an MP3 player. The functionality of each of these devices has gone through an important evolution over recent years, with a steep increase in both the number of features as in the quality of the services that they provide. However, providing the required compute power to support (an uncompromised combination of) all this functionality is highly non-trivial. Designing processors that meet the demanding requirements of future mobile devices requires the optimization of the embedded system in general and of the embedded processors in particular, as they should strike the correct balance between flexibility, energy efficiency and performance. In general, a designer will try to minimize the energy consumption (as far as needed) for a given performance, with a sufficient flexibility. However, achieving this goal is already complex when looking at the processor in isolation, but, in reality, the processor is a single component in a more complex system. In order to design such complex system successfully, critical decisions during the design of each individual component should take into account effect on the other parts, with a clear goal to move to a global Pareto optimum in the complete multi-dimensional exploration space. In the complex, global design of battery-operated embedded systems, the focus of Ultra-Low Energy Domain-Specific Instruction-Set Processors is on the energy-aware architecture exploration of domain-specific instruction-set processors and the co-optimization of the datapath architecture, foreground memory, and instruction memory organisation with a link to the required mapping techniques or compiler steps at the early stages of the design. By performing an extensive energy breakdown experiment for a complete embedded platform, both energy and performance bottlenecks have been identified, together with the important relations between the different components. Based on this knowledge, architecture extensions are proposed for all the bottlenecks.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;5
2;Contents;8
3;Glossary and Acronyms;17
4;Chapter 1:Introduction;20
4.1;1.1 Context;20
4.1.1;1.1.1 Processor design: a game of many trade-offs;22
4.1.2;1.1.2 High level trade-off target;25
4.2;1.2 Focus of this book;26
4.3;1.3 Overview of the main differentiating elements;29
4.4;1.4 Structure of this book;32
5;Chapter 2:Global State-of-the-Art Overview;35
5.1;2.1 Architectural components and mapping;36
5.1.1;2.1.1 Processor core;36
5.1.1.1;2.1.1.1 The FUs, slots and PEs of the datapath;37
5.1.1.2;2.1.1.2 Foreground memory (or register files);38
5.1.1.3;2.1.1.3 Processor pipelining;39
5.1.1.4;2.1.1.4 Issue logic;40
5.1.1.5;2.1.1.5 Overview of state-of-the-art processor classes;40
5.1.2;2.1.2 Data memory hierarchy;43
5.1.3;2.1.3 Instruction/configuration memory organization;43
5.1.4;2.1.4 Inter-core communication architecture;44
5.2;2.2 Platform architecture exploration;44
5.2.1;2.2.1 Exploration strategy;45
5.2.2;2.2.2 Criteria/cost metric;46
5.2.2.1;2.2.2.1 Performance;46
5.2.2.2;2.2.2.2 Energy consumption;46
5.2.2.3;2.2.2.3 Area;48
5.2.2.4;2.2.2.4 Design effort;48
5.2.2.5;2.2.2.5 Flexibility;49
5.2.3;2.2.3 Evaluation method;49
5.3;2.3 Conclusion and key messages of this chapter;50
6;Chapter 3: Energy Consumption Breakdown and Requirements for an Embedded Platform;51
6.1;3.1 Platform view: a processor is part of a system;52
6.2;3.2 A video platform case study;53
6.2.1;3.2.1 Video encoder/decoder description and context;54
6.2.1.1;3.2.1.1 Driver application;54
6.2.1.2;3.2.1.2 Embedded platform description;55
6.2.1.3;3.2.1.3 Inter-tile communication architecture;56
6.2.1.4;3.2.1.4 Mapping the application to the architecture;57
6.2.2;3.2.2 Experimental results for platform components;58
6.2.2.1;3.2.2.1 Experimental procedure;58
6.2.2.2;3.2.2.2 Embedded processor datapath logic;59
6.2.2.3;3.2.2.3 Datapath pipeline registers;60
6.2.2.4;3.2.2.4 Data and instruction memory hierarchy;61
6.2.2.5;3.2.2.5 Inter-tile communication architecture;62
6.2.3;3.2.3 Power breakdown analysis;63
6.2.4;3.2.4 Conclusions for the platform case study;67
6.3;3.3 Embedded processor case study;67
6.3.1;3.3.1 Scope of the case study;68
6.3.2;3.3.2 Processor styles;69
6.3.2.1;3.3.2.1 Software pipelining (e.g. modulo scheduling);70
6.3.2.2;3.3.2.2 Clustering Clustering (clustered VLIW);70
6.3.2.3;3.3.2.3 Coarse-grained reconfigurable architecture;71
6.3.2.4;3.3.2.4 SIMD or sub-word parallelism;73
6.3.2.5;3.3.2.5 Custom instructions and/or FUs ;73
6.3.2.6;3.3.2.6 Optimized data memory hierarchy;74
6.3.2.7;3.3.2.7 Hybrid combinations;74
6.3.3;3.3.3 Focus of the experiments;74
6.3.4;3.3.4 Experimental results for the processor case study;75
6.3.4.1;3.3.4.1 RISC;76
6.3.4.2;3.3.4.2 Centralized VLIW;76
6.3.4.3;3.3.4.3 Clustered VLIW;78
6.3.4.4;3.3.4.4 Coarse-grained architectures;80
6.3.5;3.3.5 Conclusions for the processor case study;81
6.4;3.4 High level architecture requirements;81
6.5;3.5 Architecture exploration and trends;83
6.5.1;3.5.1 Interconnect scaling in future technologies;83
6.5.2;3.5.2 Representative architecture exploration examples: What are the bottlenecks?;84
6.6;3.6 Architecture optimization of different platform components;86
6.6.1;3.6.1 Algorithm design;86
6.6.2;3.6.2 Data memory hierarchy;87
6.6.3;3.6.3 Foreground memory organization;88
6.6.4;3.6.4 Instruction/Configuration Memory Organization(ICMO);91
6.6.5;3.6.5 Datapath parallelism;93
6.6.6;3.6.6 Datapath–address path;95
6.7;3.7 Putting it together: FEENECS template;96
6.8;3.8 Comparison to related work;98
6.9;3.9 Conclusions and key messages of this chapter;98
7;Chapter 4:Overall Framework for Exploration;100
7.1;4.1 Introduction and motivation;100
7.2;4.2 Compiler and simulator flow;103
7.2.1;4.2.1 Memory architecture subsystem;104
7.2.1.1;4.2.1.1 Data memory hierarchy;105
7.2.1.2;4.2.1.2 Instruction/Configuration Memory Organization/Hierarchy (ICMO);106
7.2.2;4.2.2 Processor core subsystem;107
7.2.2.1;4.2.2.1 Processor datapath;107
7.2.2.2;4.2.2.2 Register File/Foreground Memory Organization;108
7.2.3;4.2.3 Platform dependent loop transformations;110
7.3;4.3 Energy estimation flow (power model);111
7.4;4.4 Comparison to related work;113
7.5;4.5 Architecture exploration for various algorithms;116
7.5.1;4.5.1 Exploration space of key parameters;116
7.5.2;4.5.2 Trends in exploration space;118
7.5.2.1;4.5.2.1 IPC trends;126
7.5.2.2;4.5.2.2 Loop buffers and their impact on Instruction Memory Hierarchy/Organization;127
7.5.2.3;4.5.2.3 Exploration time;129
7.6;4.6 Conclusion and key messages of this chapter;130
8;Chapter 5:Clustered L0 (Loop) Buffer Organization and Combinationwith Data Clusters;131
8.1;5.1 Introduction and motivation ;132
8.2;5.2 Distributed L0 buffer organization;132
8.2.1;5.2.1 Filling distributed L0 buffers;134
8.2.2;5.2.2 Regulating access;135
8.2.3;5.2.3 Indexing into L0 buffer partitions;136
8.2.4;5.2.4 Fetching from L0 buffers or L1 cache ;137
8.3;5.3 An illustration;137
8.4;5.4 Architectural evaluation;139
8.4.1;5.4.1 Energy reduction due to clustering;141
8.4.2;5.4.2 Proposed organization versus centralizedorganizations;144
8.4.3;5.4.3 Performance issues;146
8.5;5.5 Comparison to related work;147
8.6;5.6 Combining L0 instruction and data clusters;149
8.6.1;5.6.1 Data clustering;150
8.6.2;5.6.2 Data clustering followed by L0 clustering;151
8.6.3;5.6.3 Simulation results;152
8.6.4;5.6.4 VLIW Variants;155
8.7;5.7 Conclusions and key messages of this chapter;156
9;Chapter 6:Multi-threading in Uni-threaded Processor;158
9.1;6.1 Introduction;158
9.2;6.2 Need for light weight multi-threading;161
9.3;6.3 Proposed multi-threading architecture;164
9.3.1;6.3.1 Extending a uni-processor for multi-threading;164
9.3.1.1;6.3.1.1 Software counter based loop controller;165
9.3.1.2;6.3.1.2 Hardware counter based loop controller;166
9.3.1.3;6.3.1.3 Running multiple loops in parallel;167
9.4;6.4 Compilation support potential;169
9.5;6.5 Comparison to related work;171
9.6;6.6 Experimental results;175
9.6.1;6.6.1 Experimental platform setup;175
9.6.2;6.6.2 Benchmarks and base architectures used;176
9.6.3;6.6.3 Energy and performance analysis;177
9.7;6.7 Conclusion and key messages of this chapter;180
10;Chapter 7: Handling Irregular Indexed Arrays and Dynamically Accessed Data on Scratchpad Memory Organisations;181
10.1;7.1 Introduction;182
10.2;7.2 Motivating example for irregular indexing;183
10.3;7.3 Related work on irregular indexed array handling;184
10.4;7.4 Regular and irregular arrays;185
10.5;7.5 Cost model for data transfer;186
10.6;7.6 SPM mapping algorithm;187
10.6.1;7.6.1 Illustrating example;187
10.6.2;7.6.2 Search-space exploration algorithm;188
10.7;7.7 Experiments and results;191
10.8;7.8 Handling dynamic data structures on scratchpadmemory organisations;193
10.9;7.9 Related work on dynamic data structure access;194
10.10;7.10 Dynamic referencing: locality optimization;195
10.10.1;7.10.1 Independent reference model;197
10.10.2;7.10.2 Comparison of DM-cache with SPM;199
10.10.3;7.10.3 Optimal mapping on SPM--results;201
10.11;7.11 Dynamic organization: locality optimization;205
10.11.1;7.11.1 MST using binary heap;206
10.11.2;7.11.2 Ultra dynamic data organization;207
10.12;7.12 Conclusion and key messages of this chapter;211
11;Chapter 8:An Asymmetrical Register File: The VWR;213
11.1;8.1 Introduction;213
11.2;8.2 High level motivation;217
11.3;8.3 Proposed micro-architecture of VWR;218
11.3.1;8.3.1 Data (background) memory organizationand interface;219
11.3.2;8.3.2 Foreground memory organization;220
11.3.3;8.3.3 Connectivity between VWR and datapath;222
11.3.4;8.3.4 Layout aspects of VWR in a standard-cellbased design;223
11.3.5;8.3.5 Custom design circuit/micro-architecture and layout;225
11.4;8.4 VWR operation;228
11.5;8.5 Comparison to related work;231
11.6;8.6 Experimental results on DSP benchmarks;233
11.6.1;8.6.1 Experimental setup;233
11.6.2;8.6.2 Benchmarks and energy savings;234
11.7;8.7 Conclusion and key messages of this chapter;236
12;Chapter 9:Exploiting Word-Width Information During Mapping;237
12.1;9.1 Word-width variation in applications;237
12.1.1;9.1.1 Fixed point refinement;239
12.1.2;9.1.2 Word-width variation in applications;243
12.2;9.2 Word-width aware energy models;244
12.2.1;9.2.1 Varying word-width or dynamic range;245
12.2.2;9.2.2 Use-cases for word-width aware energy models;246
12.2.3;9.2.3 Example of word-width aware energy estimation;247
12.3;9.3 Exploiting word-width variation in mapping;248
12.3.1;9.3.1 Assignment;249
12.3.1.1;9.3.1.1 Concept;249
12.3.1.2;9.3.1.2 Expected gains;250
12.3.2;9.3.2 Scheduling;251
12.3.2.1;9.3.2.1 Concept;251
12.3.2.2;9.3.2.2 Expected gains;252
12.3.3;9.3.3 ISA selection;255
12.3.3.1;9.3.3.1 Concept;255
12.3.3.2;9.3.3.2 Expected gains;255
12.3.4;9.3.4 Data parallelization;256
12.3.4.1;9.3.4.1 Concept;257
12.3.4.2;9.3.4.2 Expected gains;258
12.4;9.4 Software SIMD;259
12.4.1;9.4.1 Hardware SIMD vs Software SIMD;259
12.4.2;9.4.2 Enabling SIMD without hardware separation;261
12.4.2.1;9.4.2.1 Corrective operations to preserve data boundaries;262
12.4.2.2;9.4.2.2 Software SIMD on a Hardware SIMD capable datapath;273
12.4.3;9.4.3 Case study 1: Homogeneous Software SIMD exploration for a Hardware SIMD capable RISC;273
12.4.4;9.4.4 Case study 2: Software SIMD exploration, including corrective operations, for a VLIW processor;278
12.5;9.5 Comparison to related work;282
12.6;9.6 Conclusions and key messages of this chapter;286
13;Chapter 10:Strength Reduction of Multipliers;288
13.1;10.1 Multiplier strength reduction: Motivation;289
13.2;10.2 Constant multiplications: A relevant sub-set;290
13.2.1;10.2.1 Types of multiplications;291
13.2.2;10.2.2 Motivating example;294
13.3;10.3 Systematic description of the global exploration/conversion space;297
13.3.1;10.3.1 Primitive conversion methods;298
13.3.1.1;10.3.1.1 Bitwise (or parallel) method;298
13.3.1.2;10.3.1.2 Recursive (or sequential) method;299
13.3.2;10.3.2 Partial conversion methods;300
13.3.2.1;10.3.2.1 Multiplicative factoring;301
13.3.2.2;10.3.2.2 Additive factoring (word splitting);302
13.3.3;10.3.3 Coding;303
13.3.4;10.3.4 Modifying the instruction-set;304
13.3.5;10.3.5 Optimization techniques;306
13.3.6;10.3.6 Implementation cost vs. operator accuracy trade-off;307
13.3.6.1;10.3.6.1 Trading off accuracy with performance;308
13.3.6.2;10.3.6.2 Preventing width expansion of multiplication results;309
13.3.7;10.3.7 Cost-aware search over conversion space;313
13.4;10.4 Experimental results;314
13.4.1;10.4.1 Experimental procedure;315
13.4.2;10.4.2 IDCT kernel (part of MPEG2 decoder);315
13.4.3;10.4.3 FFT kernel, including accuracy trade-offs;317
13.4.4;10.4.4 DWT kernel, part of architecture exploration;320
13.4.5;10.4.5 Online biotechnology monitoring application;323
13.4.6;10.4.6 Potential improvements of the strength reduction;324
13.4.6.1;10.4.6.1 Loop Buffer with Local Controller;324
13.4.6.2;10.4.6.2 Link between SSA, CSD and performance;324
13.4.6.3;10.4.6.3 Multiple precision MUL operations;325
13.5;10.5 Comparison to related work;325
13.6;10.6 Conclusions and key messages of chapter;326
14;Chapter 11:Bioimaging ASIP benchmark study;328
14.1;11.1 Bioimaging application and quantisation;329
14.2;11.2 Effective constant multiplication realisation with shiftand adds;335
14.3;11.3 Architecture exploration for scalar ASIP-VLIW options;344
14.3.1;11.3.1 Constant multiplication FU mapping: Specific SAand SAS options;351
14.3.2;11.3.2 FUs for the Generic SAs;352
14.3.3;11.3.3 Cost-effective mapping of detection algorithm;356
14.4;11.4 Data-path architecture exploration for data-parallelASIP options;359
14.5;11.5 Background and foreground memory organisationfor SoftSIMD ASIP;366
14.5.1;11.5.1 Basic proposal for 2D array access scheme;366
14.5.2;11.5.2 Overall schedule for SoftSIMD option;369
14.6;11.6 Energy results and discussion;371
14.6.1;11.6.1 Data path energy estimation for critical Gauss loop of scalar ASIP;372
14.6.2;11.6.2 Data path energy estimation for critical Gaussloops of SoftSIMD ASIP;374
14.6.3;11.6.3 Data path energy estimation for overall Detectionalgorithm;376
14.6.4;11.6.4 Energy modeling for SRAM and VWR contribution;378
14.6.5;11.6.5 Memory energy contributions;380
14.6.6;11.6.6 Global performance and energy results for options;381
14.7;11.7 Conclusions and key messages of chapter;385
15;Chapter 12:Conclusions;386
15.1;12.1 Related work overview;386
15.2;12.2 Ultra low energy architecture exploration;387
15.3;12.3 Main energy-efficient platform components;388
16;Bibliography;391




