Optimal sequencing strategies for surveying molecular genetic diversity.
Pluzhnikov A., Donnelly P.
Two commonly used measures of genetic diversity for intraspecies DNA sequence data are based, respectively, on the number of segregating sites, and on the average number of pairwise nucleotide differences. Expressions are derived for their variance in the presence of intragenic recombination for a panmictic population of fixed size that is at neutral equilibrium at the region sequenced. We show that, in contrast to the slow decrease in variance with increasing sample size, if the recombination rate is nonzero, the asymptotic rate of decrease of variance with increasing sequence length, for fixed sample size, is quite rapid. In particular, it is close to that which would be obtained by sequencing independent chromosome regions. The correlation between measures of diversity from linked regions is also examined. For a given total number of bases sequenced in a particular region, optimal sequencing strategies are derived. These typically involve sequencing relatively few (three to 10) long copies of the region. Under optimal strategies, the variances of the two measures are very similar for most parameter values considered. Results concerning optimal sequencing strategies will be sensitive to gross departures from the underlying assumptions, such as population bottlenecks, selective sweeps, and substantial population substructure.