E-Book, Englisch, 315 Seiten
Lysaght / Rosenstiel New Algorithms, Architectures and Applications for Reconfigurable Computing
1. Auflage 2005
ISBN: 978-1-4020-3128-1
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
E-Book, Englisch, 315 Seiten
ISBN: 978-1-4020-3128-1
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
New Algorithms, Architectures and Applications for Reconfigurable Computing consists of a collection of contributions from the authors of some of the best papers from the Field Programmable Logic conference (FPL'03) and the Design and Test Europe conference (DATE'03). In all, seventy-nine authors, from research teams from all over the world, were invited to present their latest research in the extended format permitted by this special volume. The result is a valuable book that is a unique record of the state of the art in research into field programmable logic and reconfigurable computing. The contributions are organized into twenty-four chapters and are grouped into three main categories: architectures, tools and applications. Within these three broad areas the most strongly represented themes are coarse-grained architectures; dynamically reconfigurable and multi-context architectures; tools for coarse-grained and reconfigurable architectures; networking, security and encryption applications. Field programmable logic and reconfigurable computing are exciting research disciplines that span the traditional boundaries of electronic engineering and computer science. When the skills of both research communities are combined to address the challenges of a single research discipline they serve as a catalyst for innovative research. The work reported in the chapters of this book captures that spirit of that innovation.
Autoren/Hrsg.
Weitere Infos & Material
1;Contents;5
2;Introduction;9
3;About the Editors;15
4;Acknowledgements;17
5;Architectures;18
5.1;1 Extra-dimensional Island-Style FPGAs Herman Schmit;20
5.1.1;1.1 Architecture;22
5.1.2;1.2 Experimental Evaluation;26
5.1.3;1.3 Time Multiplexing and Forward-compatiblity;28
5.1.4;1.4 Conclusions;29
5.1.5;References;29
5.2;2 A Tightly Coupled VLIW/Reconfigurable Matrix and its Modulo Scheduling Technique ;32
5.2.1;2.1 Introduction;32
5.2.2;2.2 ADRES Architecture;33
5.2.2.1;2.2.1 Architecture Description;33
5.2.2.2;2.2.2 Improved Performance with the VLIW Processor;35
5.2.2.3;2.2.3 Simplified Programming Model and Reduced Communication Cost;36
5.2.3;2.2.4 Resource Sharing;36
5.2.4;2.3 Modulo Scheduling;37
5.2.4.1;2.3.1 Problem Illustrated;37
5.2.4.2;2.3.2 Modulo Routing Resource Graph;38
5.2.4.3;2.3.3 Modulo Scheduling Algorithm;40
5.2.5;2.4 Experimental Results;42
5.2.6;2.5 Conclusions and Future Work;44
5.2.7;References;44
5.3;3 Stream-based XPP Architectures in Adaptive System-on-Chip Integration;46
5.3.1;3.1 Introduction;46
5.3.2;3.2 Stream-based XPP Architecture;48
5.3.2.1;3.2.1 Array Concept and Datapath Structure;49
5.3.2.2;3.2.2 Stream Processing and Selfsynchronization;49
5.3.2.3;3.2.3 Configuration Handling;50
5.3.3;3.3 Adaptive XPP-based System-on-Chip;50
5.3.4;3.4 XPP64A: First-Time-Right-Silicon;54
5.3.5;3.5 Application Evaluation—Examples;56
5.3.6;3.6 Conclusions;57
5.3.7;References;58
5.4;4 Core-Based Architecture for Data Transfer Control in SoC Design;60
5.4.1;4.1 Introduction;60
5.4.2;4.2 Digital Systems with Very Time Consuming Data Exchange Requirements. Design Alternatives;61
5.4.3;4.3 System on a Reprogrammable Chip Design Methodology;63
5.4.4;4.4 SoRC Core-Based Architecture;64
5.4.4.1;4.4.1 Communication Bus IP Cores;65
5.4.4.2;4.4.2 Data Transfer Bus IP Cores;65
5.4.4.3;4.4.3 Main Processor Bus IP Cores;68
5.4.5;4.5 Verification and Analysis User Interface;68
5.4.6;4.6 Results and Conclusions;69
5.4.7;References;70
5.5;5 Customizable and Reduced Hardware Motion Estimation Processors;72
5.5.1;5.1 Introduction;72
5.5.2;5.2 Base FSBM Architecture;74
5.5.3;5.3 Architectures for Limited Resources Devices;75
5.5.3.1;5.3.1 Decimation at the Pixel Level;76
5.5.3.2;5.3.2 Reduction of the Precision of the Pixel Values;78
5.5.4;5.4 Implementation and Experimental Results;78
5.5.5;5.5 Conclusion;82
5.5.6;References;83
6;Methodologies and Tools;84
6.1;6 Enabling Run-time Task Relocation on Reconfigurable Systems;86
6.1.1;6.1 Hardware/Software Multitasking on a Reconfigurable Computing Platform;87
6.1.2;6.2 Uniform Communication Scheme;89
6.1.3;6.3 Unified Design of Hardware and Software with OCAPI-xl;91
6.1.4;6.4 Heterogeneous Context Switch Issues;92
6.1.5;6.5 Relocatable Video Decoder;94
6.1.5.1;6.5.1 The T-ReCS Gecko Demonstrator;94
6.1.5.2;6.5.2 The Video Decoder;94
6.1.5.3;6.5.3 Results;95
6.1.6;6.6 Conclusions;96
6.1.7;References;96
6.2;7 A Unified Codesign Environment;98
6.2.1;7.1 Related Work;99
6.2.2;7.2 System Architecture;100
6.2.2.1;7.2.1 Task Model;101
6.2.2.2;7.2.2 Task Manager Program;102
6.2.3;7.3 Codesign Environment;103
6.2.4;7.4 Implementation in the UltraSONIC Platform;105
6.2.5;7.5 A Case Study of FFT Algorithm;106
6.2.6;7.6 Conclusions;107
6.2.7;References;108
6.3;8 Mapping Applications to a Coarse Grain Reconfigurable System;110
6.3.1;8.1 Introduction;110
6.3.2;8.2 The Target Architecture: MONTIUM;111
6.3.3;8.3 A Four-Phase Decomposition;112
6.3.4;8.4 Translating C to a CDFG;113
6.3.5;8.5 Clustering;114
6.3.6;8.6 Scheduling;115
6.3.7;8.7 Allocation;117
6.3.8;8.8 Conclusion;119
6.3.9;8.9 Related work;119
6.3.10;References;120
6.4;9 Compilation and Temporal Partitioning for a Coarse-grain Reconfigurable Architecture;122
6.4.1;9.1 Introduction;122
6.4.2;9.2 The XPP Architecture and the Configure-Execute Paradigm;123
6.4.3;9.3 Compilation;125
6.4.4;9.4 Experimental Results;128
6.4.5;9.5 Related Work;130
6.4.6;9.6 Conclusions;130
6.4.7;References;131
6.5;10 Run-time Defragmentation for Dynamically Reconfigurable Hardware;134
6.5.1;Introduction;135
6.5.2;Dynamic Relocation;138
6.5.3;Rearranging Routing Resources;143
6.5.4;Conclusion;145
6.5.5;References;145
6.6;11 Virtual Hardware Byte Code as a Design Platform for Recon.gurable Embedded Systems;148
6.6.1;11.1 Introduction;148
6.6.1.1;11.1.1 State of the Art;150
6.6.1.2;11.1.2 Our Approach;152
6.6.2;11.2 The Virtual Hardware Byte Code;152
6.6.3;11.3 The Byte Code Compiler;154
6.6.4;11.4 The Virtual Hardware Machine;155
6.6.5;11.5 Results;157
6.6.6;11.6 Conclusions and Future Work;159
6.6.7;References;159
6.7;12 A Low Energy Data Management for Multi-Context Reconfigurable Architectures;162
6.7.1;12.1 Introduction;162
6.7.2;12.2 Architecture and Framework Overview;164
6.7.3;12.3 Problem Overview;165
6.7.4;12.4 Low Energy RC-RAM Management;166
6.7.5;12.5 Low Energy FB Management;168
6.7.6;12.6 Low Energy CM Management;169
6.7.7;12.7 Experimental Results;170
6.7.8;12.8 Conclusions;171
6.7.9;References;172
6.8;13 Dynamic and Partial Reconfiguration in FPGA SoCs: Requirements Tools and a Case Study;174
6.8.1;13.1 Introduction;174
6.8.2;13.2 Requirements for FPGA SoC DRSs;176
6.8.3;13.3 Tools for DRS;177
6.8.4;13.4 A DRS Case Study: Design and Experimental Results;179
6.8.5;13.5 Conclusions;183
6.8.6;References;184
7;Applications;186
7.1;14 Design Flow for a Reconfigurable Processor Implementation of a Turbo-decoder;188
7.1.1;14.1 Introduction;188
7.1.2;14.2 Related Work;190
7.1.3;14.3 Design Flow for the Reconfigurable Processor;191
7.1.4;14.4 Design Tools for the Reconfigurable Processor;194
7.1.5;14.5 Case Study: Turbo Decoding;196
7.1.6;14.6 Conclusions;198
7.1.7;References;198
7.2;15 IPsec-Protected Transport of HDTV over IP;200
7.2.1;15.1 Introduction;200
7.2.2;15.2 GRIP System Architecture;201
7.2.3;15.3 GRIP Hardware;203
7.2.3.1;15.3.1 Basic platform;203
7.2.3.2;15.3.2 X1/X2 IPsec Accelerator Cores;204
7.2.4;15.4 Integrating GRIP with the Operating System;204
7.2.5;15.5 Example Application: Encrypted Transport of HDTV over IP;206
7.2.5.1;15.5.1 Background;206
7.2.5.2;15.5.2 Design and Implementation;206
7.2.6;15.6 Related Work;207
7.2.7;15.7 Results;208
7.2.7.1;15.7.1 System Performance;208
7.2.7.2;15.7.2 Evaluating Hardware Implementations;209
7.2.8;15.8 Conclusions and Future Work;209
7.2.9;References;211
7.3;16 Fast, Large-scale String Match for a 10 Gbps FPGA-based NIDS;212
7.3.1;16.1 Introduction;212
7.3.2;16.2 Architecture of Pattern Matching Subsystem;214
7.3.2.1;16.2.1 Pipelined Comparator;215
7.3.2.2;16.2.2 Pipelined Encoder;216
7.3.2.3;16.2.3 Packet Data Fan-out;216
7.3.2.4;16.2.4 VHDL Generator;217
7.3.3;16.3 Evaluation Results;217
7.3.3.1;16.3.1 Performance;217
7.3.3.2;16.3.2 Cost: Area and Latency;219
7.3.4;16.4 Comparison with Previous Work;220
7.3.5;16.5 Conclusions and Future Work;221
7.3.6;References;224
7.4;17 Architecture and FPGA Implementation of a Digit-serial RSA Processor Alessandro Cilardo, Antonino Mazzeo, Luigi Romano, Giacinto Paolo Saggese;226
7.4.1;17.1 Algorithm Used for the RSA Processor;228
7.4.2;17.2 Architecture of the RSA Processor;229
7.4.3;17.3 FPGA Implementation and Performance Analysis;232
7.4.4;17.4 Related Work;234
7.4.5;17.5 Conclusions;234
7.4.6;References;235
7.5;18 Division in GF(p) for Application in Elliptic Curve Cryptosystems on Field Programmable Logic;236
7.5.1;18.1 Introduction;236
7.5.2;18.2 Elliptic Curve Cryptography over GF(p);237
7.5.3;18.3 Modular Inversion;239
7.5.4;18.4 Modular Division;239
7.5.5;18.5 Basic Division Architecture;240
7.5.6;18.6 Proposed Carry-Select Division Architecture;241
7.5.7;18.7 Results;243
7.5.8;18.8 Conclusions;245
7.5.9;References;245
7.6;19 A New Arithmetic Unit in GF(2M) for Reconfigurable Hardware Implementation;248
7.6.1;19.1 Introduction;248
7.6.2;19.2 Mathematical Background;250
7.6.2.1;19.2.1 GF(2m) Field Arithmetic for ECC;250
7.6.2.2;19.2.2 GF(2m) Field Arithmetic for ECC;251
7.6.3;19.3 A New Dependence Graph for Both Division and Multiplication in GF(2m);251
7.6.3.1;19.3.1 Dependence Graph for Division in GF(2m);251
7.6.3.2;19.3.2 DG for MSB-.rst Multiplication in GF(2m);256
7.6.3.3;19.3.3 A New DG for Both Division and Multiplication in GF(2m);258
7.6.4;19.4 A New AU for Both Division and Multiplication in GF(2m);260
7.6.5;19.5 Results and Conclusions;263
7.6.6;References;265
7.6.7;20 Performance Analysis of SHACAL-1 Encryption Hardware Architectures Maire McLoone, J.V. McCanny;268
7.7;20 Performance Analysis of SHACAL-1 Encryption Hardware Architectures;268
7.7.1;20.1 Introduction;268
7.7.2;20.2 A Description of the SHACAL-1 Algorithm;269
7.7.2.1;20.2.1 SHACAL-1 Decryption;271
7.7.3;20.3 SHACAL-1 Hardware Architectures;272
7.7.3.1;20.3.1 Iterative SHACAL-1 Architectures;272
7.7.3.2;20.3.2 Fully and Sub-Pipelined SHACAL-1 Architectures;276
7.7.4;20.4 Performance Evaluation;278
7.7.5;20.5 Conclusions;279
7.7.6;References;280
7.8;21 Security Aspects of FPGAs in Cryptographic Applications ;282
7.8.1;21.1 Introduction and Motivation;282
7.8.2;21.2 Shortcomings of FPGAs for Cryptographic Applications;283
7.8.2.1;21.2.1 Why does Someone Wants to Attack FPGAs?;283
7.8.2.2;21.2.2 Description of the Black Box Attack;284
7.8.2.3;21.2.3 Cloning of SRAM FPGAs;284
7.8.2.4;21.2.4 Description of the Readback Attack;284
7.8.2.5;21.2.5 Reverse-Engineering of the Bitstreams;285
7.8.2.6;21.2.6 Description of Side Channel Attacks;286
7.8.2.7;21.2.7 Description of Physical Attacks;286
7.8.3;21.3 Prevention of Attacks;290
7.8.3.1;21.3.1 How to Prevent Black Box Attacks;291
7.8.3.2;21.3.2 How to Prevent Cloning of SRAM FPGAs;291
7.8.3.3;21.3.3 How to Prevent Readback Attacks;292
7.8.3.4;21.3.4 How to Prevent Side Channel Attack;292
7.8.3.5;21.3.5 How to Prevent Physical Attacks;293
7.8.4;21.4 Conclusions;293
7.8.5;References;294
7.9;22 Bioinspired Stimulus Encoder for Cortical Visual Neuroprostheses ;296
7.9.1;22.1 Introduction;296
7.9.2;22.2 Model Architecture;298
7.9.2.1;22.2.1 Retina Early Layers;298
7.9.2.2;22.2.2 Neuromorphic Pulse Coding;300
7.9.3;22.3 FPL Implementation;301
7.9.3.1;22.3.1 The Retina Early Layers;301
7.9.3.2;22.3.2 Neuromorphic Pulse Coding;303
7.9.4;22.4 Experimental Results;304
7.9.5;22.5 Conclusions;306
7.9.6;References;307
7.10;23 A Smith-Waterman Systolic Cell ;308
7.10.1;23.1 Introduction;308
7.10.2;23.2 The Smith-Waterman Algorithm;310
7.10.3;23.3 FPGA Implementation;312
7.10.4;23.4 Results;315
7.10.5;23.5 Conclusion;317
7.10.6;References;317
7.11;24 The Effects of Polynomial Degrees;318
7.11.1;24.1 Background;320
7.11.2;24.2 The Hierarchical Segmentation Method;321
7.11.3;24.3 The Effects of Polynomial Degrees;323
7.11.4;24.4 Evaluation and Results;327
7.11.5;24.5 Conclusion;329
7.11.6;References;330
11.3 The Byte Code Compiler (p.137- 138)
The Byte Code Compiler is a very important feature of the VHBC approach, because it provides the means to compile working hardware designs, coded as a VHDL description, into a portable and efficient VHBC representation, thus removing the need for redesigning working hardware projects. The tool flow within the VHDL compiler can basically be divided into three main stages, the hardware synthesis, the net list to byte code conversion and the byte code optimization and scheduling.
In the first stage the VHDL description is compiled into a net list of standard components and standard logic optimization is performed upon it, resulting in an optimized net list. The design of the compiler chain can be streamlined through the use of off-the-shelf hardware synthesis tools. Current implementations of the VHDL compiler make e.g. use of the FPGAExpress tool from Synopsis. These tools produce the anticipated code using a fairly standardized component library, as in the case of FPGA Express the SimPrim library from Xilinx. The resulting output of the first stage is converted to structural VHDL and passed on to the second stage. Most standard industry VHDL compilers with a support for FPGAs design readily provide the functionality needed for this step and can therefore be applied.
In the second stage the components of the net list are substituted by VHBC fragments to form aVHBCinstruction stream. Before, however, the components are mapped to a VHBC representation, the net list is analyzed and optimized for VHBC. The optimization is necessary because commercial compilers targeting FPGAs usually output designs which contain large amounts of buffers to enhance signal integrity otherwise impaired by the routing of the signals. Furthermore, compilers show a tendency towards employing logic representations based on NAND or NOR gates, which are more ef.cient when cast into silicon.
However, the resulting logic structure is more complex, revealing higher levels of logic. The code fragments used for substituting the logic components are based on predefined, general implementations of the latter in VHBC and are adjusted according to the data flow found in the structural description from the first phase, thus registers are allocated and the instructions are sequenced according to the data dependencies inherent.
In the third stage the byte code sequence is optimized and scheduled into blocks of independent instructions. First of all the data flow graph of the entire design is constructed, which is possible due to the lack of control flow instructions such as jumps. The code fragments introduced in the second stage are very general, so the resulting code gives a lot of room to code optimization techniques. One such technique is dead code elimination, which removes unnecessary instructions. The code is further optimized by applying predefined code substitution rules along the data paths, such as XOR extraction or doublenegation removal, to reduce the number of instructions and compact the code.
The thus optimized code is scheduled using a list based scheduling scheme [14]. The objective of the scheduling is to group the instructions into code blocks such that the number of code blocks is minimal and the number of instructions per code block is evenly distributed among all code blocks. Furthermore, the time of data not being used, i.e. the number of clock cycles between the calculation of a datum and its use in another operation should be minimal. The scheduled code is then converted to the VHBC image format and the compiler flow concludes.




