E-Book, Englisch, Band 11, 304 Seiten
Reihe: Computational Biology
Axelson-Fisk Comparative Gene Finding
1. Auflage 2010
ISBN: 978-1-84996-104-2
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Models, Algorithms and Implementation
E-Book, Englisch, Band 11, 304 Seiten
Reihe: Computational Biology
ISBN: 978-1-84996-104-2
Verlag: Springer
Format: PDF
Kopierschutz: 1 - PDF Watermark
Comparative genomics is a new and emerging ?eld, and with the explosion of ava- able biological sequences the requests for faster, more ef?cient and more robust algorithms to analyze all this data are immense. This book is meant to serve as a self-contained instruction of the state-of-the-art of computational gene ?nding in general and of comparative approaches in particular. It is meant as an overview of the various methods that have been applied in the ?eld, and a quick introduction into how computational gene ?nders are built in general. A beginner to the ?eld could use this book as a guide through to the main points to think about when constructing a gene ?nder, and the main algorithms that are in use. On the other hand, the more experienced gene ?nder should be able to use this book as a reference to different methods and to the main components incorporated in these methods. I have focused on the main uses of the covered methods and avoided much of the technical details and general extensions of the models. In exchange I have tried to supply references to more detailed accounts of the different research areas touched upon. The book, however, makes no claim on being comprehensive.
Autoren/Hrsg.
Weitere Infos & Material
1;Preface;7
2;Acknowledgments;9
3;Contents;10
4;Acronyms;14
5;Introduction;15
5.1;Some Basic Genetics;15
5.2;The Central Dogma;17
5.3;The Structure of a Gene;19
5.4;How Many Genes Do We Have?;21
5.5;Problems of Gene Definitions;25
5.6;The Gene Finding Problem;26
5.7;Comparative Gene Finding;28
5.8;History of Algorithm Development;29
5.9;To Build a Gene Finder;32
5.10;References;35
6;Single Species Gene Finding;41
6.1;Hidden Markov Models (HMMs);41
6.1.1;Markov Chains;42
6.1.1.1;Discrete-Time Markov Chains;42
6.1.1.2;Stationarity and Reversibility;48
6.1.1.3;Continuous-Time Markov Chains;50
6.1.2;Hidden Markov Models;53
6.1.3;Dynamic Programming;56
6.1.3.1;Silent Begin and End States;58
6.1.4;The Forward Algorithm;59
6.1.5;The Backward Algorithm;59
6.1.6;The Viterbi Algorithm;61
6.1.7;EasyGene: A Prokaryotic Gene Finder;63
6.1.7.1;Posterior Decoding;65
6.1.7.2;Statistical Significance of Predictions;65
6.2;Generalized Hidden Markov Models (GHMMs);66
6.2.1;Preliminaries;66
6.2.2;The Forward and Backward Algorithms;68
6.2.2.1;The Forward Variables;68
6.2.2.2;The Backward Variables;70
6.2.3;The Viterbi Algorithm;70
6.2.4;Genscan: A GHMM-Based Gene Finder;71
6.2.4.1;Sequence Generation Algorithm;74
6.2.4.2;Reducing Computational Complexity;74
6.2.4.3;Exon Probabilities;78
6.3;Interpolated Markov Models (IMMs);81
6.3.1;Preliminaries;81
6.3.2;Linear and Rational Interpolation;82
6.3.3;GLIMMER: A Microbial Gene Finder;83
6.3.3.1;Gene Prediction;84
6.3.3.2;Training the IMM;85
6.3.3.3;GlimmerM;86
6.4;Neural Networks;86
6.4.1;Biological Neurons;87
6.4.2;Artificial Neurons and the Perceptron;88
6.4.3;Multi-Layer Neural Networks;90
6.4.4;GRAIL: A Neural Network-Based Gene Finder;91
6.5;Decision Trees;93
6.5.1;Classification;94
6.5.2;Decision Tree Learning;95
6.5.3;MORGAN: A Decision Tree-Based Gene Finder;99
6.6;References;100
7;Sequence Alignment;103
7.1;Pairwise Sequence Alignment;103
7.1.1;Dot Plot Matrix;105
7.1.2;Nucleotide Substitution Models;106
7.1.2.1;The Jukes-Cantor Model;108
7.1.2.2;The Kimura Model;109
7.1.2.3;The Felsenstein Model;110
7.1.2.4;The Tamura and Nei Model;111
7.1.2.5;General Time-Reversible (GTR) Model;111
7.1.3;Amino Acid Substitution Models;112
7.1.3.1;The PAM Matrix;113
7.1.3.2;The BLOSUM Matrix;117
7.1.3.3;The GONNET matrix;120
7.1.4;Gap Models;120
7.1.5;The Needleman-Wunsch Algorithm;122
7.1.5.1;Needleman-Wunsch Using Affine Gaps;124
7.1.6;The Smith-Waterman Algorithm;126
7.1.7;Pair Hidden Markov Models (PHMMs);128
7.1.7.1;Preliminaries;128
7.1.7.2;The Forward, Backward, and Viterbi Algorithms;130
7.1.8;Database Similarity Searches;132
7.1.8.1;FASTA;132
7.1.8.2;BLAST;134
7.1.8.3;Gapped BLAST;136
7.1.8.4;PSI-BLAST;136
7.1.9;The Significance of Alignment Scores;137
7.2;Multiple Sequence Alignment;138
7.2.1;Scoring Schemes;139
7.2.1.1;Sum-of-Pairs (SP);141
7.2.1.2;Weighted Sum-of-Pairs (WSP);141
7.2.1.3;Minimum Entropy;141
7.2.1.4;Gap Costs;142
7.2.2;Phylogenetic Trees;143
7.2.2.1;The Neighbor-Joining Method;143
7.2.2.2;Fitch-Margoliash;144
7.2.3;Dynamic Programming;145
7.2.3.1;The MSA Package;145
7.2.4;Progressive Alignments;147
7.2.5;Iterative Methods;150
7.2.6;Hidden Markov Models;153
7.2.6.1;SAM-Sequence Alignment and Modeling;153
7.2.7;Genetic Algorithms;155
7.2.8;Simulated Annealing;158
7.2.9;Alignment Profiles;161
7.2.9.1;Standard Profiles;161
7.2.9.2;Profile HMMs;163
7.2.9.3;Scoring a New Sequence;164
7.3;References;165
8;Comparative Gene Finding;171
8.1;Similarity-Based Gene Finding;171
8.1.1;GenomeScan: GHMM-Based Gene Finding Using Homology;172
8.1.2;Twinscan: GHMM-Based Gene Finding Using Informant Sequences;174
8.2;Heuristic Cross-Species Gene Finding;176
8.2.1;ROSETTA;176
8.3;Pair Hidden Markov Models (PHMMs);177
8.3.1;DoubleScan: A PHMM-Based Comparative Gene Finder;178
8.3.1.1;The State Space;178
8.3.1.2;The Stepping Stone Algorithm;180
8.4;Generalized Pair Hidden Markov Models (GPHMMs);181
8.4.1;Preliminaries;181
8.4.1.1;The Forward, Backward and Viterbi Algorithms;182
8.4.2;SLAM: A GPHMM-Based Comparative Gene Finder;184
8.4.2.1;The State Space;184
8.4.2.2;Reducing Computational Complexity;186
8.5;Gene Mapping;188
8.5.1;Projector: A Gene Mapping Tool;188
8.5.2;GeneMapper-Reference Based Annotation;189
8.6;Multiple Sequence Gene Finding;190
8.6.1;N-SCAN: A Multiple Informant-Based Gene Finder;191
8.7;References;193
9;Gene Structure Submodels;195
9.1;The State Space;195
9.1.1;The Exon States;196
9.1.2;Splice Sites;198
9.1.3;Introns and Intergenic Regions;199
9.1.4;Untranslated Regions (UTRs);200
9.1.5;Promoters and PolyA-signals;201
9.2;State Length Distributions;202
9.2.1;Geometric and Negative Binomial Lengths;202
9.2.2;Empirical Length Distributions;205
9.2.3;Acyclic Discrete Phase Type Distributions;206
9.3;Sequence Content Sensors;210
9.3.1;GC-Content Binning;210
9.3.2;Start Codon Recognition;211
9.3.3;Codon and Amino Acid Usage;212
9.3.4;K-Tuple Frequency Analysis;214
9.3.5;Markov Chain Content Sensors;215
9.3.6;Interpolated Markov Models;217
9.4;Splice Site Detection;218
9.4.1;Weight Matrices and Weight Array Models;218
9.4.2;Variable-Length Markov Models (VLMMs);221
9.4.3;Maximal Dependence Decomposition (MDD);223
9.4.3.1;The Position with the Strongest Influence;224
9.4.3.2;Score a New Sequence;228
9.4.4;Neural Networks;229
9.4.5;Linear Discriminant Analysis;231
9.4.5.1;Quadratic Discriminant Analysis (QDA);231
9.4.5.2;Linear Discriminant Analysis (LDA);232
9.4.6;Maximum Entropy;235
9.4.6.1;The Maximum Entropy Method;236
9.4.6.2;Application to Splice Site Detection;239
9.4.6.3;Iterative Scaling;240
9.4.7;Bayesian Networks;241
9.4.7.1;Preliminaries;241
9.4.7.2;Some Bayesian Theory;242
9.4.7.3;Training a Bayesian Network;245
9.4.7.4;Application to Splice Site Detection;246
9.4.8;Support Vector Machines;247
9.4.8.1;Linearly Separable Classes;248
9.4.8.2;Nearly Linear SVMs;251
9.4.8.3;Nonlinear SVMs;251
9.4.8.4;SVMs in Splice Site Detection;254
9.5;References;256
10;Parameter Training;259
10.1;Introduction;259
10.2;Pseudocounts;260
10.2.1;The SAM Regularizer;261
10.3;Maximum Likelihood Estimation;262
10.3.1;HMM Training on Labeled Sequences;265
10.4;The Expectation-Maximization (EM) Algorithm;268
10.5;The Baum-Welch Algorithm;275
10.5.1;The Forward-Backward Algorithm;275
10.5.2;The Baum-Welch Algorithm;277
10.6;Gradient Ascent/Descent;279
10.7;The Backpropagation Algorithm;282
10.7.1;The Feed-Forward Step;284
10.7.2;The Backpropagation Step;286
10.7.3;The Gradient Descent Step;287
10.7.4;Several Training Patterns;287
10.8;Discriminative Training;288
10.8.1;Conditional Maximum Likelihood;288
10.8.2;Maximum Mutual Information;289
10.8.3;Minimum Classification Error;290
10.9;Gibbs Sampling;292
10.9.1;Gibbs Sampling for HMM Training;293
10.10;Simulated Annealing;294
10.10.1;Simulated Annealing for Training of HMMs;297
10.11;References;297
11;Implementation of a Comparative Gene Finder;299
11.1;Program Structure;299
11.1.1;Command Line Arguments;300
11.1.2;Parameter Files;302
11.1.3;Candidate Exon Boundaries;304
11.1.4;Output Files;305
11.2;The GPHMM Model;306
11.2.1;Modeling Intron and Intergenic Pairs;306
11.2.2;Modeling Exon Pairs;308
11.2.3;Approximate Alignment;309
11.3;Accuracy Assessment;310
11.4;Possible Model Extensions;311
11.5;References;312
12;Index;313




