Hardy-Weinberg Deviation and EM-based Haplotype Frequency Estimation

Thumbnail Image
Issue Date
Ahn, Hyeong Jun
The Graduate School, Stony Brook University: Stony Brook, NY.
Single-nucleotide polymorphisms (SNPs) are the most common type of genetic variation in human genome. Haplotypes which combine multiple SNPs into super-alleles have been widely used in modern genetic analysis, especially in human disease association studies. The Expectation Maximization (EM) algorithm is commonly used in haplotype phasing and frequency estimation, and Hardy-Weinberg (HW) equilibrium is a key assumption built into the EM algorithm. The accuracy of EM-based haplotype frequency estimation when the HW equilibrium assumption is violated has been explored by several studies. The general consensus is that the sampling error plays a more dominant role in haplotypes estimation than the estimation error due to HW deviation; the accuracy of haplotype frequency estimation tends to improve with increasing homozygosity in the sample. However, these studies mainly concentrated on the impact of SNP level HW deviation. A theoretical foundation for the impact of HW deviation at the haplotype level on haplotype frequency estimation has not been established. In this dissertation, we derived the theoretical relationship among three haplotype mean squared errors: between population and sample frequencies (MSEPS), between true sample and sample estimated frequencies (MSESE), and between population and sample estimated frequencies (MSEPE). The theoretical relationship between SNP level and haplotype level HW deviations was also established. Our simulations show that the violation of HW equilibrium at haplotype level could result in more severe haplotype estimation error than sampling error, and the accuracy of haplotype frequency estimation is not always improved with increasing homozygosity. To incorporate the possible haplotype level HW deviations into the haplotype frequency estimation process, we propose a Hardy-Weinberg Deviation-Expectation/Conditional Maximization (HWD-ECM) method which allows us to estimate HW deviation parameters and haplotype frequencies simultaneously. For two SNPs cases, the HWD-ECM algorithm consists of three iteration steps: 1). an expectation step estimating genotype frequencies allowing HW deviation parameters; 2). a conditional maximization step for HW deviation parameter estimation utilizing constraints of SNP level or haplotype level HW deviation parameters; and 3). a conditional maximization step for haplotype frequencies. Simulation results show that the HWD-ECM method performs significantly better than the EM-based approach in haplotype estimation when HWE assumption is violated. Algorithm for extension of HWD-ECM to multiple SNPs is also discussed.