Print This Page
Annika Tillander
 doctoral thesis 
Classifcation models for highdimensional data with sparsity patterns
Abstract
Today's highthroughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units. Further, in this highdimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the twoclass supervised classification in this highdimensional, low sample setting are considered.
There are classifiers that are known to mitigate the issues of highdimensionality, e.g. distancebased classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less timeconsuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for highdimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in highdimensional setting is evaluated.
Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to highdimensional setting. In the second paper, a twostage estimation procedure of the inverse covariance matrix, applying Lassobased regularization and CuthillMcKee ordering is suggested. The estimation gives a blockdiagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for blockwise feature selection is proposed.
Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the blockdiagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the highdimensional problem by splitting it into smaller tractable problems.
The relevance and benefits of the proposed methods are illustrated using both simulated and real data.
Keywords: Highdimensionality, supervised classification, classification accuracy, sparse, blockdiagonal covariance structure, graphical Lasso, separation strength, discretization
ISBN 9789174477726
Download Summarising chapter >>
Download paper I >>
Effect of data discretization on the classification accuracy in a highdimensional framework.
Download paper II >>
Covariance structure approximation via gLasso in highdimensional supervised classification.
Download paper III >>
Empirical evaluation of sparse classification boundaries and HCfeature thresholding in highdimensional data.
Download paper IV >>
Bayesian BlockDiagonal Predictive Classifier for Gaussian Data
