Abstract:
Whole genome sequencing and whole exome sequencing are developing techniques to explore the associations between rare variants and complex diseases. The number of variants that are expected to appear in a randomly selected group that do not appear in a different group randomly selected from the same population has unknown mean and variance. Expressions for these quantities are derived here. Numerical values are calculated assuming that the frequency of a rare variant has a beta distribution using parameters estimated for four populations. Extensions to the number of variants that appear in r ( r >1) members of a randomly selected group with none in the comparison group are given. These calculations suggest that a genome wide study of rare variants would generate an extremely large number of false positives. Similarly, an exome wide search would also generate a smaller but still overwhelming number of false positives. A search restricted to variants in a specified gene would not generate excessive numbers of false positives. The expectations using the beta model fit a SNP database well when the underlying beta distribution was restricted to variant frequencies greater than 0.001.