Detailed Table of Contents

List of Code Fragments page ix

Part one: Basic Concepts 1

1 Pattern Analysis 3
1.1 Patterns in Data 4
1.1.1 Data 4
1.1.2 Patterns 5
1.2 Pattern Analysis Algorithms 12
1.2.1 Statistical Stability of Patterns 13
1.2.2 Detecting Patterns by Recoding 15
1.3 Exploiting Patterns 17
1.3.1 The Overall Strategy 18
1.3.2 Common Pattern Analysis Tasks 19
1.4 Summary 22
1.5 Further Reading and Advanced Topics 23

2 Kernel Methods: an Overview 25
2.1 The Overall Picture 26
2.2 Linear Regression in a Feature Space 27
2.2.1 Primal Linear Regression 27
2.2.2 Ridge Regression: Primal and Dual 30
2.2.3 Kernel-Defined Non Linear Feature Mappings 32
2.3 Other Examples 36
2.3.1 Algorithms 36
2.3.2 Kernels 39
2.4 The Modularity of Kernel Methods 41
2.5 Roadmap of the Book 42
2.6 Summary 43
2.7 Further Reading and Advanced Topics 44

3 Properties of Kernels 46
3.1 Inner Products and Positive Semi-Definite Matrices 47
3.1.1 Hilbert Spaces 47
3.1.2 Gram Matrix 51
3.2 Characterisation of Kernels 59
3.3 The Kernel Matrix 67
3.4 Kernel Construction 73
3.4.1 Operations on Kernel Functions 73
3.4.2 Operations on Kernel Matrices 78
3.5 Summary 81
3.6 Further Reading and Advanced Topics 81

4 Detecting Stable Patterns 83
4.1 Concentration Inequalities 84
4.2 Capacity and Regularisation: Rademacher Theory 91
4.3 Pattern Stability for Kernel-Based Classes 95
4.4 A Pragmatic Approach 102
4.5 Summary 103
4.6 Further Reading and Advanced Topics 104

Part two: Pattern Analysis Algorithms 107

5 Elementary Algorithms in Feature Space 109
5.1 Means and Distances 110
5.1.1 A Simple Algorithm for Novelty-Detection 114
5.1.2 A Simple Algorithm for Classification 118
5.2 Computing Projections: Gram�Schmidt, QR and Cholesky 119
5.3 Measuring the Spread of the Data 127
5.4 Fisher Discriminant Analysis I 130
5.5 Summary 135
5.6 Further Reading and Advanced Topics 135

6 Pattern Analysis using Eigen-Decompositions 137
6.1 Singular Value Decomposition 138
6.2 Principal Components Analysis 140
6.2.1 Kernel Principal Components Analysis 147
6.2.2 Stability of Principal Components Analysis 148
6.3 Directions of Maximum Covariance 152
6.4 The Generalised Eigenvector Problem 158
6.5 Canonical Correlation Analysis 161
6.6 Fisher Discriminant Analysis II 172
6.7 Methods for Linear Regression 173
6.7.1 Partial Least Squares 177
6.7.2 Kernel partial least squares 183
6.8 Summary 188
6.9 Further Reading and Advanced Topics 189

7 Pattern Analysis using Convex Optimisation 191
7.1 The Smallest Enclosing Hypersphere 192
7.1.1 The Smallest Hypersphere Containing a Set of Points 193
7.1.2 Stability of Novelty-Detection 196
7.1.3 Hyperspheres Containing Most of the Points 198
7.2 Support Vector Machines for Classification 207
7.2.1 The Maximal Margin Classifier 208
7.2.2 Soft Margin Classifiers 215
7.3 Support Vector Machines for Regression 226
7.3.1 Stability of Regression 227
7.3.2 Ridge Regression 228
7.3.3 e-Insensitive Regression 230
7.4 On-line Classification and Regression 236
7.5 Summary 244
7.6 Further Reading and Advanced Topics 246

8 Ranking, Clustering and Data Visualisation 248
8.1 Discovering Rank Relations 249
8.1.1 Batch Ranking 251
8.1.2 On-line Ranking 256
8.2 Discovering Cluster Structure in a Feature Space 260
8.2.1 Measuring Cluster Quality 261
8.2.2 Greedy Solution: k-means 267
8.2.3 Relaxed Solution: Spectral Methods 270
8.3 Data Visualisation 275
8.4 Summary 280
8.5 Further Reading and Advanced Topics 281

Part three: Constructing Kernels 283

9 Basic Kernels and Kernel Types 285
9.1 Kernels in Closed Form 286
9.2 ANOVA Kernels 291
9.3 Kernels from Graphs 298
9.4 Diffusion Kernels on Graph Nodes 304
9.5 Kernels on Sets 308
9.6 Kernels on real numbers 311
9.7 Randomised Kernels 314
9.8 Other Kernel Types 316
9.8.1 Kernels from Successive Embeddings 316
9.8.2 Kernels over General Structures 317
9.8.3 Kernels from Generative Information 318
9.9 Summary 318
9.10 Further Reading and Advanced Topics 319

10 Kernels for Text 321
10.1 From Bag of Words to Semantic Space 322
10.1.1 Representing Text 322
10.1.2 Semantic Issues 324
10.2 Vector Space Kernels 325
10.2.1 Designing Semantic Kernels 327
10.2.2 Designing the Proximity Matrix 329
10.3 Summary 335
10.4 Further Reading and Advanced Topics 336

11 Kernels for Structured Data: Strings, Trees, etc. 338
11.1 Comparing Strings and Sequences 339
11.2 Spectrum Kernels 341
11.3 All-Subsequences Kernels 345
11.4 Fixed Length Subsequences Kernels 351
11.5 Gap-Weighted Subsequences Kernels 354
11.5.1 Naive implementation 356
11.5.2 Efficient Implementation 360
11.5.3 Variations on the Theme 361
11.6 Beyond Dynamic Programming: Trie-Based Kernels 366
11.6.1 Trie Computation of the p-Spectrum Kernels 368
11.6.2 Trie-Based Mismatch Kernels 370
11.6.3 Trie-Based Restricted Gap-Weighted Kernels 373
11.7 Kernels for Structured Data 376
11.7.1 Comparing Trees 377
11.7.2 Structured Data: a Framework 384
11.8 Summary 389
11.9 Further Reading and Advanced Topics 389

12 Kernels from Generative Models 391
12.1 P-Kernels 392
12.1.1 Conditional-Independence (CI) and Marginalisation 394
12.1.2 Representing Multivariate Distributions 395
12.1.3 Fixed Length Strings Generated by a Hidden Binomial Model 397
12.1.4 Fixed Length Strings Generated by a Hidden Markov Model 399
12.1.5 Pair Hidden Markov Model Kernels 403
12.1.6 Hidden Tree Model Kernels 409
12.2 Fisher Kernels 414
12.2.1 From Probability to Geometry 414
12.2.2 Fisher Kernels for Hidden Markov Models 422
12.3 Summary 428
12.4 Further Reading and Advanced Topics 428

Part four: Appendices 429

Appendix A Proofs Omitted from the Main Text 431
Appendix B Notational Conventions 438
Appendix C List of Pattern Analysis Methods 440
Appendix D List of Kernels 442

Bibliography 445