Print This Page

Jessica Franzén

-- licentiate thesis --

On Cluster Analysis - A Bayesian and Model-Based Approach

Cluster analysis is the automated search for homogenous and cohesive groups in a given data set. Traditional cluster analysis is based on deterministic methods which use measures between objects and objects and centroids to create well separated groups. Despite considerable research, there is little guidance how to handle practical questions such as how many clusters there are and how to handle outliers objects. A model-based approach to cluster analysis is presented. As opposed to the mechanical classification used in deterministic clustering, we regard observations as outcomes of different distributions. A finite mixture model is used, where each probability distribution corresponds to a cluster. This approach opens up for new possibilities. The model is capable to handle groups of different sizes, shapes, and directions by allowing for different distributions and parametrization among clusters. In reality, clusters do seldom appear as well separated. The method handles overlapping groups, by taking into account cluster membership probabilities in these areas. In many data sets there are objects not suitable for classification. A special approach of this thesis is to create a deviant cluster of larger variance, consisting of these outlier objects. Bayesian inference via Gibbs sampling is used to estimate distribution parameters and proportions between clusters. The method is tested on simulated and real data sets and shows promising results. Model selection by an approximation of Bayes factors is applied, with the purpose of selecting the number of clusters and to decide if a deviant group is to prefer in the model.

Download Introduction and Summary of Reports -->>

Download report 1: Bayesian Inference for a Mixture Model using Gibbs Sampler -->>

Download report 2: Model-Based Cluster Analysis - Classification of Twelve Year Old Children with a Deviant Group -->>