E-Book, Englisch, 325 Seiten
Govindaraju / Setlur Guide to OCR for Indic Scripts
1. Auflage 2009
ISBN: 978-1-84800-330-9
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Document Recognition and Retrieval
E-Book, Englisch, 325 Seiten
Reihe: Advances in Computer Vision and Pattern Recognition
ISBN: 978-1-84800-330-9
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
This is the first comprehensive text on Optical Character Recognition for Indic scripts. It covers many topics and describes OCR systems for eight different scripts-Bangla, Devanagari, Gurmukhi, Gujarti, Kannada, Malayalam, Tamil and Urdu.
Autoren/Hrsg.
Weitere Infos & Material
1;Foreword;4
2;Preface;6
2.1;1 Part I: Recognition of Indic Scripts;9
2.2;2 Part II: Retrieval of Indic Documents;11
2.3;3 Target Audience;11
3;Acknowledgments;13
4;Contents;14
5;Contributors;16
6;Part I Recognition of Indic Scripts;19
7;Building Data Sets for Indian Language OCR Research;20
7.1;1 Introduction;20
7.2;2 Datasets;21
7.2.1;2.1 Image Corpus;21
7.2.1.1;2.1.1 Digitization;22
7.2.1.2;2.1.2 Processing and Storage;22
7.2.2;2.2 Text Corpus;23
7.2.3;2.3 Annotated Data Sets;23
7.3;3 Annotation;24
7.3.1;3.1 Hierarchical Annotation;26
7.3.1.1;3.1.1 Different Levels of Annotation;26
7.3.1.2;3.1.2 Methods of Annotation;27
7.3.2;3.2 Annotation Process;28
7.3.2.1;3.2.1 Segmentation;28
7.3.2.2;3.2.2 Components Labeling;29
7.3.2.3;3.2.3 Annotation Tools;31
7.4;4 Representation and Access;32
7.4.1;4.1 Sources of Metainformation;33
7.4.2;4.2 Recognizer-Specific Metainformation;34
7.4.3;4.3 Digitization Meta Information;34
7.4.4;4.4 Annotation Data;35
7.4.4.1;4.4.1 Page Structure Information;36
7.4.4.2;4.4.2 Text Block Structure Information;36
7.4.4.3;4.4.3 Akshara Structure Information;37
7.4.5;4.5 Representation Issues;37
7.4.5.1;4.5.1 Complex Layout;37
7.4.5.2;4.5.2 Indian Language Script Issues;37
7.4.6;4.6 Data Access;38
7.5;5 Implementation and Execution;39
7.5.1;5.1 Organization of Tasks;39
7.5.2;5.2 Status of the Data Sets;40
7.6;6 Conclusions;40
7.7;References;41
8;On OCR of Major Indian Scripts: Bangla and Devanagari;43
8.1;1 Introduction;43
8.2;2 Basic OCR System;45
8.2.1;2.1 Group and Individual Character Classifiers;48
8.3;3 Quantification of Errors;50
8.4;4 Post-recognition Error Correction;52
8.4.1;4.1 Forward--Backward Error Correction Scheme;53
8.5;5 Discussion;57
8.6;References;57
9;A Complete Machine-Printed Gurmukhi OCR System;59
9.1;1 Introduction;59
9.2;2 Characteristics of Gurmukhi Script;60
9.2.1;2.1 Character Set;60
9.2.2;2.2 Connectivity of Symbols;60
9.2.3;2.3 Word Partitioning into Zones;61
9.2.4;2.4 Frequently Touching Characters;62
9.2.5;2.5 Broken Characters and Headlines;62
9.2.6;2.6 Similarity of Group of Symbols;62
9.3;3 System Overview;62
9.4;4 Digitization and Pre-processing;62
9.5;5 Splitting Text into Horizontal Text Strips;64
9.6;6 Word Segmentation;67
9.7;7 Sub-division of Strips into Smaller Units;68
9.8;8 Repairing the Word Shape;69
9.9;9 Thinning;70
9.10;10 Repairing Broken Characters;72
9.11;11 Character Segmentation;74
9.11.1;11.1 Touching Characters;77
9.12;12 Recognition Stage;78
9.12.1;12.1 Feature Extraction;78
9.12.2;12.2 Classification;80
9.12.2.1;12.2.1 Design of the Binary Tree Classifier;81
9.12.3;12.3 Merging Sub-symbols;81
9.13;13 Post-Processing;84
9.13.1;13.1 Check for the Existence of a Word in the Corpus;84
9.13.2;13.2 Perform Holistic Recognition of a Word;84
9.14;14 Experimental Results;85
9.15;15 Conclusion;86
9.16;References;87
10;Progress in Gujarati Document Processing and Character Recognition;88
10.1;1 Introduction;88
10.2;2 Gujarati Script: OCR Perspective;89
10.3;3 Segmentation;91
10.4;4 Zone Boundary Identification;92
10.4.1;4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners;93
10.4.2;4.2 Dynamic Programming Approach;95
10.5;5 Extracting Recognizable Units;98
10.6;6 Recognition;98
10.6.1;6.1 Feature Extraction;99
10.6.1.1;6.1.1 Fringe Map;100
10.6.1.2;6.1.2 Discrete Cosine Transform;100
10.6.1.3;6.1.3 Wavelet Transform;101
10.6.1.4;6.1.4 Zone Information;102
10.6.1.5;6.1.5 Aspect Ratio;102
10.6.2;6.2 Classification;102
10.6.2.1;6.2.1 Nearest Neighbor Classifier;102
10.6.2.2;6.2.2 Artificial Neural Networks [ 25 , 26 ];103
10.6.2.3;6.2.3 Multi-layer Perceptron (MLP) [ 25 ];103
10.6.2.4;6.2.4 Radial Basis Functions (RBF) networks;103
10.6.2.5;6.2.5 General Regression Neural Network (GRNN);104
10.6.3;6.3 Experimental Setup and Results;106
10.7;7 Text Generation;107
10.8;8 Post-processing;108
10.9;9 Conclusion;108
10.10;References;109
11;Design of a Bilingual KannadaEnglish OCR;111
11.1;1 Introduction;111
11.2;2 Kannada Script;112
11.3;3 Segmentation;112
11.3.1;3.1 Line Segmentation Based on Connected Components;114
11.3.2;3.2 Word and Character Segmentation;115
11.4;4 Script Recognition;115
11.4.1;4.1 Gabor and DCT-Based Identification;116
11.4.2;4.2 Results of Script Identification;117
11.5;5 Component Classification;119
11.5.1;5.1 Introduction;119
11.5.2;5.2 Graph Representations for Components;120
11.5.3;5.3 Distance Measures;122
11.5.4;5.4 Classification Strategy;123
11.5.5;5.5 Training;123
11.5.6;5.6 Prediction;124
11.5.7;5.7 Experiments, Results and Discussion;124
11.5.7.1;5.7.1 Data Sets;124
11.5.7.2;5.7.2 Features for SVM Classifiers;126
11.5.7.3;5.7.3 Pre-processing;128
11.5.7.4;5.7.4 Results and Discussions;128
11.6;6 Conclusion;137
11.7;References;137
12;Recognition of Malayalam Documents;139
12.1;1 Introduction;139
12.1.1;1.1 The Malayalam Language;140
12.1.1.1;1.1.1 Origin;140
12.1.1.2;1.1.2 Literary Culture;140
12.1.1.3;1.1.3 Word and Sentence Formation;141
12.1.2;1.2 The Malayalam Script;141
12.1.2.1;1.2.1 Script Revision;143
12.1.3;1.3 Evolution of Printing and Publication;144
12.1.4;1.4 Challenges in Malayalam Recognition;145
12.2;2 Character Recognition;146
12.2.1;2.1 Overview of the Approach;146
12.2.2;2.2 Design Guidelines;147
12.2.3;2.3 Features for Component Classification;148
12.2.4;2.4 Classifier Design;148
12.2.5;2.5 Beyond Recognition of Isolated Symbols;150
12.3;3 Recognition of Online Handwriting;151
12.3.1;3.1 Stroke Recognition;152
12.3.1.1;3.1.1 Dealing with Similar Strokes;153
12.3.2;3.2 Word Recognizer;154
12.4;4 Experimental Results;154
12.4.1;4.1 Overview of the Data Set;154
12.4.2;4.2 Classifier and Feature Comparisons;155
12.4.3;4.3 Recognition of Online Handwriting;157
12.5;5 Conclusions;158
12.6;References;159
13;A Complete OCR System for Tamil Magazine Documents;161
13.1;1 Introduction and Background;161
13.1.1;1.1 Preprocessing;162
13.1.1.1;1.1.1 Skew Estimation;163
13.1.1.2;1.1.2 Binarization;163
13.1.2;1.2 Page Segmentation and Classification;163
13.1.2.1;1.2.1 Page Segmentation;163
13.1.2.2;1.2.2 Block Classification;164
13.1.3;1.3 Optical Character Recognition (OCR);164
13.1.3.1;1.3.1 Character Segmentation;164
13.1.3.2;1.3.2 Character Recognition;165
13.1.4;1.4 Logical Structure;165
13.1.4.1;1.4.1 Document Models;166
13.2;2 Preprocessing;166
13.2.1;2.1 Image Size Reduction;166
13.2.2;2.2 Skew Correction;167
13.2.2.1;2.2.1 Text Recognition;167
13.2.2.2;2.2.2 Skew Estimation;168
13.2.3;2.3 Binarization;168
13.2.4;2.4 Noise Removal;168
13.3;3 Segmentation and Classification;168
13.3.1;3.1 Page Segmentation;169
13.3.2;3.2 Classification of the Blocks;169
13.4;4 Optical Character Recognition;170
13.4.1;4.1 Line, Word, and Character Segmentation;170
13.4.2;4.2 Recognition of Characters;171
13.5;5 Reconstruction of the Document Image;171
13.5.1;5.1 Logical Structure Derivation;171
13.5.2;5.2 Reconstruction into HTML Format;172
13.6;6 Results and Conclusions;172
13.6.1;6.1 Results;173
13.6.2;6.2 Conclusions;174
13.7;References;175
14;Experiments on Urdu Text Recognition;177
14.1;1 Introduction;177
14.2;2 Urdu Language Resources;180
14.3;3 Prior Work in Urdu Recognition Systems;181
14.4;4 Prior Work in Urdu Document Preprocessing;182
14.5;5 Experiments;183
14.6;References;184
15;The BBN Byblos Hindi OCR System;186
15.1;1 Introduction;186
15.1.1;1.1 Background;186
15.1.2;1.2 Review of Basic OCR System;187
15.1.3;1.3 Model Training and Recognition;188
15.2;2 DATA;189
15.2.1;2.1 Hindi Character Set;189
15.2.2;2.2 Corpus;191
15.3;3 Experimental Results;191
15.3.1;3.1 Model Configuration;191
15.3.2;3.2 Recognition Performance;192
15.4;4 Conclusions;192
15.5;References;193
16;Generalization of Hindi OCR Using Adaptive Segmentation and Font Files;194
16.1;1 Introduction;194
16.1.1;1.1 Challenges of Segmentation;195
16.1.2;1.2 Feature Extraction and Classification;196
16.2;2 Base Devanagari OCR System;197
16.2.1;2.1 Background;197
16.2.2;2.2 System Design;198
16.2.3;2.3 Character Segmentation;200
16.2.3.1;2.3.1 Devanagari Script Overview;200
16.2.3.2;2.3.2 Hindi Character Segmentation;200
16.2.4;2.4 Feature Extraction;206
16.2.5;2.5 Classification;208
16.2.5.1;2.5.1 Template Matching;208
16.2.5.2;2.5.2 Generalized Hausdorff Image Comparison (GHIC);208
16.2.5.3;2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance;209
16.2.5.4;2.5.4 Hierarchical Classification;209
16.2.6;2.6 Devanagari OCR Evaluation;210
16.2.7;2.7 Additional Challenges;210
16.3;3 Font-Based Intelligent Character Segmentation;212
16.3.1;3.1 Benefits and Font Models;212
16.3.2;3.2 Training Using Font Files;214
16.3.3;3.3 Segmentation and Recognition;214
16.4;4 Experiments;215
16.4.1;4.1 Data Sets;216
16.4.2;4.2 Protocols for Evaluation;217
16.4.3;4.3 Character Segmentation;217
16.4.4;4.4 Feature Extraction;217
16.4.5;4.5 Recognition Results;218
16.5;5 Conclusion and Future Work;218
16.6;References;219
17;Online Handwriting Recognition for Indic Scripts;221
17.1;1 Introduction;221
17.2;2 The Structure of Indic Scripts;222
17.3;3 Challenges for Online HWR;224
17.3.1;3.1 Large Alphabet Size;224
17.3.2;3.2 Two-Dimensional Structure;225
17.3.3;3.3 Inter-class Similarity;225
17.3.4;3.4 Issues with Writing Styles;226
17.3.5;3.5 Language-Specific and Regional Differences in Usage;227
17.4;4 Recognition of Isolated Characters;228
17.4.1;4.1 Strategies;229
17.4.2;4.2 Preprocessing;230
17.4.3;4.3 Features;230
17.4.4;4.4 Classification;231
17.5;5 Word Recognition;234
17.5.1;5.1 Preprocessing;235
17.5.2;5.2 Analytic Approaches Based on Explicit Segmentation;235
17.5.3;5.3 Analytic Approaches Based on Implicit Segmentation;236
17.5.4;5.4 Holistic Approaches;237
17.5.5;5.5 Language Models;238
17.6;6 Applications;238
17.7;7 Resources;240
17.7.1;7.1 Data Set Standards;241
17.7.2;7.2 Tools;241
17.7.3;7.3 Data Sets;242
17.8;8 Summary;242
17.9;References;243
18;Part II Retrieval of Indic Documents;247
19;Enhancing Access to Primary Cultural Heritage Materials of India;248
19.1;1 Introduction;248
19.2;2 Linguistic Tools;251
19.3;3 Image-Processing Tools;256
20;Digital Image Enhancement of Indic Historical Manuscripts;259
20.1;1 Introduction;259
20.2;2 Image Enhancement;261
20.2.1;2.1 Background Normalization;261
20.2.1.1;2.1.1 Background Normalization Using a Piece-Wise Linear Model;262
20.2.1.2;2.1.2 Background Normalization Using a Nonlinear Model;264
20.2.2;2.2 Image Normalization;266
20.2.3;2.3 Background Normalization for Color Images;267
20.2.4;2.4 Color Document Image Enhancement;268
20.3;3 Experiments;269
20.4;4 Extract Text Lines from Images;270
20.4.1;4.1 ALCM Method;272
20.4.1.1;4.1.1 ALCM Transform;272
20.4.1.2;4.1.2 Locations of Possible Text Lines;274
20.4.1.3;4.1.3 Extraction of Text;275
20.5;5 Conclusion;276
20.6;References;276
21;GFG-Based Compression and Retrieval of Document Images in Indian Scripts;278
21.1;1 Introduction;278
21.2;2 Geometric Feature Graph (GFG) of a Word Image;280
21.2.1;2.1 GFG Extraction;281
21.2.2;2.2 Converting the GFG to a String Representation;282
21.2.3;2.3 Reconstruction of Word Images Using GFG;283
21.2.4;2.4 GFG Compression;284
21.3;3 GFG-Based Indexing;285
21.4;4 Latent Semantic Indexing Using GFG;285
21.4.1;4.1 Results of Using LSA and PLSA;287
21.5;5 Ontology-Based Access with GFG;290
21.5.1;5.1 Concept-Driven Document Image Retrieval;290
21.5.2;5.2 Results;291
21.6;6 Conclusion;292
21.7;References;293
22;Word Spotting for Indic Documents to Facilitate Retrieval;294
22.1;1 Introduction;294
22.2;2 Related Work;296
22.3;3 Proposed Methodologies;297
22.3.1;3.1 Recognition-Based Keyword Spotting;297
22.3.1.1;3.1.1 Performance;302
22.3.2;3.2 Recognition-Free Keyword Spotting;303
22.3.2.1;3.2.1 Performance;307
22.4;4 Conclusion;307
22.5;References;308
23;Indian Language Information Retrieval;309
23.1;1 Introduction;309
23.1.1;1.1 Background;311
23.2;2 Overview of Indian Language IR;311
23.2.1;2.1 Information Sources;311
23.2.2;2.2 Research Efforts;312
23.2.2.1;2.2.1 Text Retrieval;313
23.2.2.2;2.2.2 Information Extraction;316
23.2.2.3;2.2.3 Question Answering;317
23.2.2.4;2.2.4 Topic Detection and Tracking;317
23.2.2.5;2.2.5 Indian Language Subtrack at CLEF 2007;318
23.3;3 The CLIA Project;319
23.3.1;3.1 The Forum for Information Retrieval Evaluation (FIRE);320
23.4;4 Conclusion;320
23.5;References;321
24;Colour Plates;323
25;Index;329




