Statistical Methods for Biological Pathway Analysis

Thumbnail Image
Issue Date
Wu, Xiao
The Graduate School, Stony Brook University: Stony Brook, NY.
This thesis features a novel theoretical development, as well as a novel application of the structural equation modeling (SEM) framework for biological pathway and biological measurement platform comparisons respectively. For the SEM methodology development, we have extended the covariate structural equation modeling (cSEM) method (Sharpe, 2010) for pathway comparisons that was limited to continuous variables on the pathway nodes and categorical variables as pathway covariates only, to allow both continuous and categorical variables as pathway nodes as well as pathway covariates. This novel mixed variable cSEM method will permit researchers to implement a pathway with both continuous variables such as gene expression levels, and categorical variables such as genotypes on the pathway nodes, and compare the pathway between different groups (diseased, normal etc.) as well as evaluate the impact of continuous variables such as age on the pathway links (i.e. connecting patterns and strengths). Culture-independent phylogenetic analysis of 16S ribosomal RNA gene sequences has emerged as an incisive method of identifying bacteria present in a specimen. However multiple competing measurement platforms are often available to enumerate the abundances of the bacteria, including Sanger sequencing, pyrosequencing, and quantitative PCR. Here we present a novel application of the latent variable SEM to estimate the reliabilities of, and the similarities between different measurement platforms, and subsequently, weigh these measures optimally for a unified analysis of the true latent microbiome composition. The latent variable SEM contains the usual repeated measures ANCOVA as special cases and, as a more general, realistic and optimal model, features superior model goodness-of-fit as well as more reliable analysis results. The third and final contribution of this thesis is the establishment of two bioinformatics pipelines in a systems biology framework to integrate incremental biological knowledge obtained through the analysis of newly available experimental data, to existing biological knowledge database, and subsequently evolve such knowledgebase to the next level. Two examples, one from the molecular study of the human inflammatory bowel diseases, and one from the study of endophytic bacteria known to impact the growth rate of certain plant, are provided to illustrate these novel pipelines.