Print This Page
Annika Tillander
-- licentiate thesis --
Supervised Classification in a high-dimensional framework
Abstract
Modern data collection generates high-dimensional data and many traditional
statistical methods are not applicable in those settings. The topic
of this thesis is supervised classification for high-dimensional data.
In the first paper we consider high-dimensional inverse covariance matrix
estimation and embed this into high-dimensional classification. We
propose a two-stage algorithm which first recovers structural zeros of the
inverse covariance matrix and then enforces block sparsity by moving nonzeros
closer to the main diagonal. The block-diagonal approximation of
the inverse covariance matrix is shown to lead to an additive classifier.
We demonstrate that accounting for the structure can yield better performance
accuracy and suggest variable selection at the block level. The
properties of this procedure in growing dimension asymptotics is investigated
and the effect of the block size on classification is explored. Lower
and upper bounds for the fraction of separative blocks are established and
constraints specified under which the reliable classification with block-wise
feature selection can be performed. We illustrate the benefits of the proposed
approach on both simulated and real data.
In the second paper we consider computational intensive classification
methods that do not rely on the inverse covariance matrix but are time
consuming. Through discretization of continuous variables, the computational
time can be reduced although this leads to a loss of information.
How this affect the misclassification in high-dimensional framework
is investigated. We propose a discretization algorithm that optimizes the
classification performance and compare it to other discretization methods
as well as results for continuous data. Our method performs well for
both simulated and real data. We empirically show for high-dimensional
data, that misclassification is of the same magnitude or even lower if the
continuous feature variables first are discretized.
Keywords: High dimensionality, supervised classification, classification
accuracy, sparse, block-diagonal covariance structure, graphical Lasso,
separation strength, discretization.
Download Summarising chapter -->>
Download paper I -->>
Covariance structure approximation via gLasso in high-dimensional supervised classification.
Download paper II -->>
Effect of data discretization on the classification accuracy in a high-dimensional framework.
|