## Testing the properties of selection criteria: an application to copy number polymorphism measurements

##### Abstract

Variation in the human genome is present in many forms, including single-nucleotide polymorphisms (SNPs) and copy number polymorphisms (CNPs). CNPs have many categories such as small insertion-deletion polymorphisms, variable number of repetitive sequences, and genomic structural alterations. A major question that researchers in the field of statistical genetics need to answer is the number of CNP categories in a given dataset. In this study, I compare five information criteria (BIC, AIC, NEC, CLC, and ICL-BIC) to find if there is a"best" measure among them in finding the correct number of components (correct number of CNP categories). I consider six design factors: equal/unequal within-component variances, high/low separations, sample size, mixture proportion, multiple random starting values, and transformation using two known number of components (3 and 6). The result indicates that under"ideal" conditions (that is, small number of components, large separation between components, constant within component variance, and no subsequent transformation of mixture data), each criterion performs well. When the data is a monotonic transformation of data from a mixture, the BIC criterion, which is the most commonly used criterion in CNP research, has a low component number accuracy rate. I then considered the application of the Box-Cox transformation whether or not it was needed. The application of the Box-Cox transformation did not reduce the component number accuracy rate of the CLC, ICL-BIC, and BIC when it was not needed. The component number accuracy rates for the BIC criterion with Box-Cox transformation applied were improved when the mixture data was transformed. The Box-Cox transformation should be used routinely with CLC, ICL-BIC, or BIC criterion to estimate the number of components in a CNP mixture analysis.