Clustering and Network Analysis with Single Nucleotide Polymorphism (SNP)

Thumbnail Image
Issue Date
Chen, Hongyan
The Graduate School, Stony Brook University: Stony Brook, NY.
The goal of the genome-wide association studies (GWAS) is to investigate the relationships between disease phenotypes and genotypes, which are usually determined by a large number of single nucleotide polymorphisms (SNPs). Currently GWAS are often underpowered to identify SNPs with small to moderate effect sizes. In order to overcome this difficulty, two major approaches, (1) meta-analysis by increasing sample size and (2) SNP pre-selection by dimension reduction, are often adopted. Dimension reduction for SNP data has been arduous due to the categorical nature of SNP that renders most association measures such as the Pearson correlation or the Euclidean distance inappropriate. In this thesis, we propose a novel (partial) canonical correlation association measure for categorical data that can be implemented to major dimension reduction approaches including: cluster analysis (CA) and partial correlation network analysis (PCNA) towards the analysis of GWAS data. Its performance is examined and comparison is made to other existing association measures. Network analysis methods such as PCNA and the Bayesian network serve as not only dimension reduction approaches but also data driven pathway discovery tools. A key objective in modern genetic studies is to discover the regulatory causal relationships between genetic mutations measured by SNPs and the resulting functional changes often gauged by gene expression levels. With the former being categorical and the latter continuous numerical data, we now face the problem of mixed data types. Our novel partial canonical correlation measure developed for categorical data can be readily extended to PCNA with mixed variables. This new approach is illustrated by using a real data example from a study on inflammatory bowel diseases conducted at Stony brook University Medical Center and the Washington University at St. Louis. Comparison is also made to Bayesian network analysis for mixed data and guidelines provided on the pros and cons of each method.
102 pg.