New Development in Cluster Analysis and Other Related Multivariate Analysis Methods

Thumbnail Image
Issue Date
Zhang, Shaonan
The Graduate School, Stony Brook University: Stony Brook, NY.
Cluster analysis is a multivariate analysis method aimed at (1) unraveling the natural groupings embedded within the data, and (2) dimension reduction. With the wide application of cluster analysis in the diversified modern research/business fields including machine learning, bioinformatics, medical image analysis, pattern recognition, market research and global climate research, many clustering algorithms have been developed to date. However, novel and/or special circumstances always call for better customized cluster analysis methods, and thus this thesis. This thesis work consists of two parts. In the first part, we extend the modern multiple-objective cluster analysis from using a single set of features to multiple distinct sets of features by developing the novel compound clustering method and the constrained clustering method. We also developed a new statistic, the "complete linkage" R <super>2</super> along with the well-known largest average silhouette, to determine the optimal number of clusters in the compound clustering. The novel compound/constrained clustering methods are illustrated through a gene microarray study with both gene expression data and gene function information. In the second part of this thesis we propose a novel algorithm for the weighted k-means clustering. Weighted k-means clustering is an extension of the k-means clustering in which a set of nonnegative weights are assigned to all the variables. We first derived the optimal variable weights for weighted k-means clustering in order to obtain more meaningful and interpretable clusters. We then improved the current weighted k-means clustering method (Huh and Lim 2009) by incorporating our novel algorithm to obtain global-optimal guaranteed variable weights based on the method of Lagrange multiplier and the Karush-Kuhn-Tucker conditions. Here we first present the related theoretical formulation and derivation of the optimal weights. Then we provide an iteration-based computing algorithm to calculate such optimal weights. Numerical examples on both simulated and well known real data are provided to illustrate our method. It is shown that our method outperforms the original proposed method in terms of classification accuracy, stability and computation efficiency.
121 pg.