Govindaraju / Setlur | Guide to OCR for Indic Scripts | E-Book | www2.sack.de
E-Book

E-Book, Englisch, 325 Seiten

Reihe: Advances in Computer Vision and Pattern Recognition

Govindaraju / Setlur Guide to OCR for Indic Scripts

Document Recognition and Retrieval
1. Auflage 2009
ISBN: 978-1-84800-330-9
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark

Document Recognition and Retrieval

E-Book, Englisch, 325 Seiten

Reihe: Advances in Computer Vision and Pattern Recognition

ISBN: 978-1-84800-330-9
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark



This is the first comprehensive text on Optical Character Recognition for Indic scripts. It covers many topics and describes OCR systems for eight different scripts-Bangla, Devanagari, Gurmukhi, Gujarti, Kannada, Malayalam, Tamil and Urdu.

Govindaraju / Setlur Guide to OCR for Indic Scripts jetzt bestellen!

Weitere Infos & Material


1;Foreword;4
2;Preface;6
2.1;1 Part I: Recognition of Indic Scripts;9
2.2;2 Part II: Retrieval of Indic Documents;11
2.3;3 Target Audience;11
3;Acknowledgments;13
4;Contents;14
5;Contributors;16
6;Part I Recognition of Indic Scripts;19
7;Building Data Sets for Indian Language OCR Research;20
7.1;1 Introduction;20
7.2;2 Datasets;21
7.2.1;2.1 Image Corpus;21
7.2.1.1;2.1.1 Digitization;22
7.2.1.2;2.1.2 Processing and Storage;22
7.2.2;2.2 Text Corpus;23
7.2.3;2.3 Annotated Data Sets;23
7.3;3 Annotation;24
7.3.1;3.1 Hierarchical Annotation;26
7.3.1.1;3.1.1 Different Levels of Annotation;26
7.3.1.2;3.1.2 Methods of Annotation;27
7.3.2;3.2 Annotation Process;28
7.3.2.1;3.2.1 Segmentation;28
7.3.2.2;3.2.2 Components Labeling;29
7.3.2.3;3.2.3 Annotation Tools;31
7.4;4 Representation and Access;32
7.4.1;4.1 Sources of Metainformation;33
7.4.2;4.2 Recognizer-Specific Metainformation;34
7.4.3;4.3 Digitization Meta Information;34
7.4.4;4.4 Annotation Data;35
7.4.4.1;4.4.1 Page Structure Information;36
7.4.4.2;4.4.2 Text Block Structure Information;36
7.4.4.3;4.4.3 Akshara Structure Information;37
7.4.5;4.5 Representation Issues;37
7.4.5.1;4.5.1 Complex Layout;37
7.4.5.2;4.5.2 Indian Language Script Issues;37
7.4.6;4.6 Data Access;38
7.5;5 Implementation and Execution;39
7.5.1;5.1 Organization of Tasks;39
7.5.2;5.2 Status of the Data Sets;40
7.6;6 Conclusions;40
7.7;References;41
8;On OCR of Major Indian Scripts: Bangla and Devanagari;43
8.1;1 Introduction;43
8.2;2 Basic OCR System;45
8.2.1;2.1 Group and Individual Character Classifiers;48
8.3;3 Quantification of Errors;50
8.4;4 Post-recognition Error Correction;52
8.4.1;4.1 Forward--Backward Error Correction Scheme;53
8.5;5 Discussion;57
8.6;References;57
9;A Complete Machine-Printed Gurmukhi OCR System;59
9.1;1 Introduction;59
9.2;2 Characteristics of Gurmukhi Script;60
9.2.1;2.1 Character Set;60
9.2.2;2.2 Connectivity of Symbols;60
9.2.3;2.3 Word Partitioning into Zones;61
9.2.4;2.4 Frequently Touching Characters;62
9.2.5;2.5 Broken Characters and Headlines;62
9.2.6;2.6 Similarity of Group of Symbols;62
9.3;3 System Overview;62
9.4;4 Digitization and Pre-processing;62
9.5;5 Splitting Text into Horizontal Text Strips;64
9.6;6 Word Segmentation;67
9.7;7 Sub-division of Strips into Smaller Units;68
9.8;8 Repairing the Word Shape;69
9.9;9 Thinning;70
9.10;10 Repairing Broken Characters;72
9.11;11 Character Segmentation;74
9.11.1;11.1 Touching Characters;77
9.12;12 Recognition Stage;78
9.12.1;12.1 Feature Extraction;78
9.12.2;12.2 Classification;80
9.12.2.1;12.2.1 Design of the Binary Tree Classifier;81
9.12.3;12.3 Merging Sub-symbols;81
9.13;13 Post-Processing;84
9.13.1;13.1 Check for the Existence of a Word in the Corpus;84
9.13.2;13.2 Perform Holistic Recognition of a Word;84
9.14;14 Experimental Results;85
9.15;15 Conclusion;86
9.16;References;87
10;Progress in Gujarati Document Processing and Character Recognition;88
10.1;1 Introduction;88
10.2;2 Gujarati Script: OCR Perspective;89
10.3;3 Segmentation;91
10.4;4 Zone Boundary Identification;92
10.4.1;4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners;93
10.4.2;4.2 Dynamic Programming Approach;95
10.5;5 Extracting Recognizable Units;98
10.6;6 Recognition;98
10.6.1;6.1 Feature Extraction;99
10.6.1.1;6.1.1 Fringe Map;100
10.6.1.2;6.1.2 Discrete Cosine Transform;100
10.6.1.3;6.1.3 Wavelet Transform;101
10.6.1.4;6.1.4 Zone Information;102
10.6.1.5;6.1.5 Aspect Ratio;102
10.6.2;6.2 Classification;102
10.6.2.1;6.2.1 Nearest Neighbor Classifier;102
10.6.2.2;6.2.2 Artificial Neural Networks [ 25 , 26 ];103
10.6.2.3;6.2.3 Multi-layer Perceptron (MLP) [ 25 ];103
10.6.2.4;6.2.4 Radial Basis Functions (RBF) networks;103
10.6.2.5;6.2.5 General Regression Neural Network (GRNN);104
10.6.3;6.3 Experimental Setup and Results;106
10.7;7 Text Generation;107
10.8;8 Post-processing;108
10.9;9 Conclusion;108
10.10;References;109
11;Design of a Bilingual KannadaEnglish OCR;111
11.1;1 Introduction;111
11.2;2 Kannada Script;112
11.3;3 Segmentation;112
11.3.1;3.1 Line Segmentation Based on Connected Components;114
11.3.2;3.2 Word and Character Segmentation;115
11.4;4 Script Recognition;115
11.4.1;4.1 Gabor and DCT-Based Identification;116
11.4.2;4.2 Results of Script Identification;117
11.5;5 Component Classification;119
11.5.1;5.1 Introduction;119
11.5.2;5.2 Graph Representations for Components;120
11.5.3;5.3 Distance Measures;122
11.5.4;5.4 Classification Strategy;123
11.5.5;5.5 Training;123
11.5.6;5.6 Prediction;124
11.5.7;5.7 Experiments, Results and Discussion;124
11.5.7.1;5.7.1 Data Sets;124
11.5.7.2;5.7.2 Features for SVM Classifiers;126
11.5.7.3;5.7.3 Pre-processing;128
11.5.7.4;5.7.4 Results and Discussions;128
11.6;6 Conclusion;137
11.7;References;137
12;Recognition of Malayalam Documents;139
12.1;1 Introduction;139
12.1.1;1.1 The Malayalam Language;140
12.1.1.1;1.1.1 Origin;140
12.1.1.2;1.1.2 Literary Culture;140
12.1.1.3;1.1.3 Word and Sentence Formation;141
12.1.2;1.2 The Malayalam Script;141
12.1.2.1;1.2.1 Script Revision;143
12.1.3;1.3 Evolution of Printing and Publication;144
12.1.4;1.4 Challenges in Malayalam Recognition;145
12.2;2 Character Recognition;146
12.2.1;2.1 Overview of the Approach;146
12.2.2;2.2 Design Guidelines;147
12.2.3;2.3 Features for Component Classification;148
12.2.4;2.4 Classifier Design;148
12.2.5;2.5 Beyond Recognition of Isolated Symbols;150
12.3;3 Recognition of Online Handwriting;151
12.3.1;3.1 Stroke Recognition;152
12.3.1.1;3.1.1 Dealing with Similar Strokes;153
12.3.2;3.2 Word Recognizer;154
12.4;4 Experimental Results;154
12.4.1;4.1 Overview of the Data Set;154
12.4.2;4.2 Classifier and Feature Comparisons;155
12.4.3;4.3 Recognition of Online Handwriting;157
12.5;5 Conclusions;158
12.6;References;159
13;A Complete OCR System for Tamil Magazine Documents;161
13.1;1 Introduction and Background;161
13.1.1;1.1 Preprocessing;162
13.1.1.1;1.1.1 Skew Estimation;163
13.1.1.2;1.1.2 Binarization;163
13.1.2;1.2 Page Segmentation and Classification;163
13.1.2.1;1.2.1 Page Segmentation;163
13.1.2.2;1.2.2 Block Classification;164
13.1.3;1.3 Optical Character Recognition (OCR);164
13.1.3.1;1.3.1 Character Segmentation;164
13.1.3.2;1.3.2 Character Recognition;165
13.1.4;1.4 Logical Structure;165
13.1.4.1;1.4.1 Document Models;166
13.2;2 Preprocessing;166
13.2.1;2.1 Image Size Reduction;166
13.2.2;2.2 Skew Correction;167
13.2.2.1;2.2.1 Text Recognition;167
13.2.2.2;2.2.2 Skew Estimation;168
13.2.3;2.3 Binarization;168
13.2.4;2.4 Noise Removal;168
13.3;3 Segmentation and Classification;168
13.3.1;3.1 Page Segmentation;169
13.3.2;3.2 Classification of the Blocks;169
13.4;4 Optical Character Recognition;170
13.4.1;4.1 Line, Word, and Character Segmentation;170
13.4.2;4.2 Recognition of Characters;171
13.5;5 Reconstruction of the Document Image;171
13.5.1;5.1 Logical Structure Derivation;171
13.5.2;5.2 Reconstruction into HTML Format;172
13.6;6 Results and Conclusions;172
13.6.1;6.1 Results;173
13.6.2;6.2 Conclusions;174
13.7;References;175
14;Experiments on Urdu Text Recognition;177
14.1;1 Introduction;177
14.2;2 Urdu Language Resources;180
14.3;3 Prior Work in Urdu Recognition Systems;181
14.4;4 Prior Work in Urdu Document Preprocessing;182
14.5;5 Experiments;183
14.6;References;184
15;The BBN Byblos Hindi OCR System;186
15.1;1 Introduction;186
15.1.1;1.1 Background;186
15.1.2;1.2 Review of Basic OCR System;187
15.1.3;1.3 Model Training and Recognition;188
15.2;2 DATA;189
15.2.1;2.1 Hindi Character Set;189
15.2.2;2.2 Corpus;191
15.3;3 Experimental Results;191
15.3.1;3.1 Model Configuration;191
15.3.2;3.2 Recognition Performance;192
15.4;4 Conclusions;192
15.5;References;193
16;Generalization of Hindi OCR Using Adaptive Segmentation and Font Files;194
16.1;1 Introduction;194
16.1.1;1.1 Challenges of Segmentation;195
16.1.2;1.2 Feature Extraction and Classification;196
16.2;2 Base Devanagari OCR System;197
16.2.1;2.1 Background;197
16.2.2;2.2 System Design;198
16.2.3;2.3 Character Segmentation;200
16.2.3.1;2.3.1 Devanagari Script Overview;200
16.2.3.2;2.3.2 Hindi Character Segmentation;200
16.2.4;2.4 Feature Extraction;206
16.2.5;2.5 Classification;208
16.2.5.1;2.5.1 Template Matching;208
16.2.5.2;2.5.2 Generalized Hausdorff Image Comparison (GHIC);208
16.2.5.3;2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance;209
16.2.5.4;2.5.4 Hierarchical Classification;209
16.2.6;2.6 Devanagari OCR Evaluation;210
16.2.7;2.7 Additional Challenges;210
16.3;3 Font-Based Intelligent Character Segmentation;212
16.3.1;3.1 Benefits and Font Models;212
16.3.2;3.2 Training Using Font Files;214
16.3.3;3.3 Segmentation and Recognition;214
16.4;4 Experiments;215
16.4.1;4.1 Data Sets;216
16.4.2;4.2 Protocols for Evaluation;217
16.4.3;4.3 Character Segmentation;217
16.4.4;4.4 Feature Extraction;217
16.4.5;4.5 Recognition Results;218
16.5;5 Conclusion and Future Work;218
16.6;References;219
17;Online Handwriting Recognition for Indic Scripts;221
17.1;1 Introduction;221
17.2;2 The Structure of Indic Scripts;222
17.3;3 Challenges for Online HWR;224
17.3.1;3.1 Large Alphabet Size;224
17.3.2;3.2 Two-Dimensional Structure;225
17.3.3;3.3 Inter-class Similarity;225
17.3.4;3.4 Issues with Writing Styles;226
17.3.5;3.5 Language-Specific and Regional Differences in Usage;227
17.4;4 Recognition of Isolated Characters;228
17.4.1;4.1 Strategies;229
17.4.2;4.2 Preprocessing;230
17.4.3;4.3 Features;230
17.4.4;4.4 Classification;231
17.5;5 Word Recognition;234
17.5.1;5.1 Preprocessing;235
17.5.2;5.2 Analytic Approaches Based on Explicit Segmentation;235
17.5.3;5.3 Analytic Approaches Based on Implicit Segmentation;236
17.5.4;5.4 Holistic Approaches;237
17.5.5;5.5 Language Models;238
17.6;6 Applications;238
17.7;7 Resources;240
17.7.1;7.1 Data Set Standards;241
17.7.2;7.2 Tools;241
17.7.3;7.3 Data Sets;242
17.8;8 Summary;242
17.9;References;243
18;Part II Retrieval of Indic Documents;247
19;Enhancing Access to Primary Cultural Heritage Materials of India;248
19.1;1 Introduction;248
19.2;2 Linguistic Tools;251
19.3;3 Image-Processing Tools;256
20;Digital Image Enhancement of Indic Historical Manuscripts;259
20.1;1 Introduction;259
20.2;2 Image Enhancement;261
20.2.1;2.1 Background Normalization;261
20.2.1.1;2.1.1 Background Normalization Using a Piece-Wise Linear Model;262
20.2.1.2;2.1.2 Background Normalization Using a Nonlinear Model;264
20.2.2;2.2 Image Normalization;266
20.2.3;2.3 Background Normalization for Color Images;267
20.2.4;2.4 Color Document Image Enhancement;268
20.3;3 Experiments;269
20.4;4 Extract Text Lines from Images;270
20.4.1;4.1 ALCM Method;272
20.4.1.1;4.1.1 ALCM Transform;272
20.4.1.2;4.1.2 Locations of Possible Text Lines;274
20.4.1.3;4.1.3 Extraction of Text;275
20.5;5 Conclusion;276
20.6;References;276
21;GFG-Based Compression and Retrieval of Document Images in Indian Scripts;278
21.1;1 Introduction;278
21.2;2 Geometric Feature Graph (GFG) of a Word Image;280
21.2.1;2.1 GFG Extraction;281
21.2.2;2.2 Converting the GFG to a String Representation;282
21.2.3;2.3 Reconstruction of Word Images Using GFG;283
21.2.4;2.4 GFG Compression;284
21.3;3 GFG-Based Indexing;285
21.4;4 Latent Semantic Indexing Using GFG;285
21.4.1;4.1 Results of Using LSA and PLSA;287
21.5;5 Ontology-Based Access with GFG;290
21.5.1;5.1 Concept-Driven Document Image Retrieval;290
21.5.2;5.2 Results;291
21.6;6 Conclusion;292
21.7;References;293
22;Word Spotting for Indic Documents to Facilitate Retrieval;294
22.1;1 Introduction;294
22.2;2 Related Work;296
22.3;3 Proposed Methodologies;297
22.3.1;3.1 Recognition-Based Keyword Spotting;297
22.3.1.1;3.1.1 Performance;302
22.3.2;3.2 Recognition-Free Keyword Spotting;303
22.3.2.1;3.2.1 Performance;307
22.4;4 Conclusion;307
22.5;References;308
23;Indian Language Information Retrieval;309
23.1;1 Introduction;309
23.1.1;1.1 Background;311
23.2;2 Overview of Indian Language IR;311
23.2.1;2.1 Information Sources;311
23.2.2;2.2 Research Efforts;312
23.2.2.1;2.2.1 Text Retrieval;313
23.2.2.2;2.2.2 Information Extraction;316
23.2.2.3;2.2.3 Question Answering;317
23.2.2.4;2.2.4 Topic Detection and Tracking;317
23.2.2.5;2.2.5 Indian Language Subtrack at CLEF 2007;318
23.3;3 The CLIA Project;319
23.3.1;3.1 The Forum for Information Retrieval Evaluation (FIRE);320
23.4;4 Conclusion;320
23.5;References;321
24;Colour Plates;323
25;Index;329



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.