Print This Page

Annika Tillander

-- licentiate thesis --

Supervised Classification in a high-dimensional framework

Modern data collection generates high-dimensional data and many traditional statistical methods are not applicable in those settings. The topic of this thesis is supervised classification for high-dimensional data.

In the first paper we consider high-dimensional inverse covariance matrix estimation and embed this into high-dimensional classification. We propose a two-stage algorithm which first recovers structural zeros of the inverse covariance matrix and then enforces block sparsity by moving nonzeros closer to the main diagonal. The block-diagonal approximation of the inverse covariance matrix is shown to lead to an additive classifier. We demonstrate that accounting for the structure can yield better performance accuracy and suggest variable selection at the block level. The properties of this procedure in growing dimension asymptotics is investigated and the effect of the block size on classification is explored. Lower and upper bounds for the fraction of separative blocks are established and constraints specified under which the reliable classification with block-wise feature selection can be performed. We illustrate the benefits of the proposed approach on both simulated and real data.

In the second paper we consider computational intensive classification methods that do not rely on the inverse covariance matrix but are time consuming. Through discretization of continuous variables, the computational time can be reduced although this leads to a loss of information. How this affect the misclassification in high-dimensional framework is investigated. We propose a discretization algorithm that optimizes the classification performance and compare it to other discretization methods as well as results for continuous data. Our method performs well for both simulated and real data. We empirically show for high-dimensional data, that misclassification is of the same magnitude or even lower if the continuous feature variables first are discretized.

Keywords: High dimensionality, supervised classification, classification accuracy, sparse, block-diagonal covariance structure, graphical Lasso, separation strength, discretization.

Download Summarising chapter -->>
Download paper I -->> Covariance structure approximation via gLasso in high-dimensional supervised classification.
Download paper II -->> Effect of data discretization on the classification accuracy in a high-dimensional framework.