# Gene Genealogy and Properties of Test Statistics of Neutrality Under Population Growth

^{*}Department of Biology, Graduate School of Sciences, Kyushu University, Fukuoka 810-8560, Japan^{†}Department of Biology, Faculty of Sciences, Kyushu University, Fukuoka 810-8560, Japan

- 1
*Corresponding author:*Department of Biology, Faculty of Sciences, Kyushu University, 4-2-1 Ropponmatsu, Chuo-ku, Fukuoka, 810-8560, Japan. E-mail: htachscb{at}mbox.nc.kyushu-u.ac.jp

## Abstract

We consider the Wright-Fisher model with exponential population growth and investigate effects of population growth on the shape of genealogy and the distributions of several test statistics of neutrality. In the limiting case as the population grows rapidly, the rapid-growth-limit genealogy is characterized. We obtained approximate expressions for expectations and variances of test statistics in the rapid-growth-limit genealogy and star genealogy. The distributions in the star genealogy are narrower than those in the cases of the simulated and rapid-growth-limit genealogies. The expectations and variances of the test statistics are monotone decreasing functions of the time length of the expansion, and the higher power of *R*_{2} against population growth is suggested to be due to their smaller variances rather than to change of the expectations. We also investigated by simulation how quickly the distributions of test statistics approach those of the rapid-growth-limit genealogy.

THE coalescent theory is one of the most powerful tools for analyzing DNA sequence data (Kingman 1982; Hudson 1990; Donnelly and Tavaré 1995). One of the most important issues of this development is to relate the genetic diversity with the population genetic processes. For this purpose, some test statistics of neutrality that detect deviations of DNA polymorphisms from neutral expectations have been developed under the assumptions of constant population size, infinite sites, nonoverlapping generations, random mating, and no recombination.

However, either one or more assumptions do not hold in most organismic populations. In recent years, properties of those test statistics with some alternative assumptions have been investigated by a number of authors. For example, Simonsen* et al.* (1995) studied power properties of three test statistics, Tajima's *D* (1989) and *D** and *F** proposed by Fu and Li (1993), and proposed a new method of constructing critical values of these test statistics. They conducted simulations under several alternatives, that is, a selective sweep event, population bottleneck, and population subdivision, and then they concluded that the most powerful test against the specific alternative hypotheses is Tajima's *D*-statistic. Also, Ramos-Onsins and Rozas (2002) classified some of those tests into three categories (those based on the distribution of the mutation frequencies, on the haplotype, and on the mismatch distribution) and studied the statistical powers of those tests for two population-growth models (sudden and logistic population-growth models). In addition, they newly defined a statistic *R*_{2} on the basis of the difference between the number of singleton mutations, the average number of nucleotide differences, and the number of segregating sites and concluded that *R*_{2} was the most powerful statistic. The effect of a recent spatial expansion on the pattern of molecular diversity within a deme was studied by Ray* et al.* (2003). They found that both the age of the expansion and the rate of migration between neighboring demes affect the shape of gene genealogies. However, all those studies on the test statistics are based only on simulations.

Here, we investigate effects of the past growth of population size on statistical properties of test statistics of neutrality. One of the reasons why we concentrate on the change of population size is that some data show this violation of the assumptions. For example, Stephens* et al.* (2001) analyzed 3899 single-nucleotide polymorphisms within 313 genes from 82 unrelated human individuals of diverse ancestry. Of the 313 genes analyzed in their study, 281 showed a negative Tajima's *D*-value. They interpreted this as strong evidence for a recent expansion of the human population. Also, Purugganan and Suddith (1999) analyzed intraspecific sequence variation at the *Apetala3* and *Pistillata* genes of *Arabidopsis thaliana* and a recent rapid expansion was suggested by the result of their analysis. Expansion of the *A. thaliana* population was also suggested by Innan and Stephan (2000).

So far, properties of the test statistics have been investigated in sudden bottleneck, sudden growth, and logistic growth models by computer simulation (Simonsen* et al*. 1995; Ramos-Onsins and Rozas 2002). In this article, we mainly analyze an exponential growth model but give some analytical treatments of the problem. We obtain the limiting density of the coalescence time and shape of the genealogy when the population grows rapidly. In this limiting case, approximate expectations and variances of some test statistics, *D*_{T} (Tajima 1989), *D*_{FL}, *D**, *F*, *F** (Fu and Li 1993), and *R*_{2} (Ramos-Onsins and Rozas 2002), can be calculated for a specified mutation rate μ (the definitions of those statistics are given in the appendix). Those statistics belong to class I statistics that use information from the mutation frequency and two other classes of statistics are listed in Ramos-Onsins and Rozas (2002). Because the class I statistics are more powerful than the other class statistics for detecting population growth, we investigated only the class I statistics here.

To see how quickly those statistics converge to the limiting values, genealogies are generated by computer simulation on the basis of the coalescent (Hudson 1990), assuming finite growth rates of the population. Also the “rapid-growth-limit genealogy” is generated and means and variances of the statistics under these two types of genealogies with given μ are compared to check the approximations. Since Marjoram and Donnelly (1994) stated that exponential population growth makes the genealogy be star-like (see also Rosenberg and Hirsh 2003), we also calculate the expectations and variances of the statistics for the “star genealogy” and compare statistical properties of the test statistics under these genealogies.

## THEORY AND METHODS

### Coalescent process and growth of population size:

We consider a haploid population that evolves according to the Wright-Fisher model with increasing population size. Suppose that there were *N _{t}* individuals

*t*generations ago. For any fixed time point

*r*, define the relative size function ν

_{Nr}by where [

*x*] denotes the greatest integer less than or equal to

*x*. We define the coalescence time

*T*as the time interval leading from a coalescent with

_{i}*i*alleles to that with

*i*− 1 alleles and

*T*as the time interval from the present up to the most recent common ancestor of the whole sample. Obviously, holds. Griffiths and Tavaré (1994) showed that a diffusion approximation of the Wright-Fisher model exists for the deterministic fluctuation ν

_{Nr}, assuming that ν

_{Nr}is approximated by the population size function ν with a continuous-time parameter as the limit as

*N*tends to infinity. Here, time is scaled in units of

_{r}*N*generations. They also showed that the joint density of (

_{r}*T*, … ,

_{n}*T*

_{2}) is given by 1where and (for more intuitive derivation, see Slatkin and Hudson 1991). Here, we consider a process under the exponential population-growth scenario; that is, we assume that the population size, originally of size

*N*

_{A}, increases as an exponential function. Namely, the relative size function can be written as where

*t*denotes the time interval of the population growth and

_{e}*t*and

*t*are measured in units of

_{e}*N*

_{A}generations. This model was considered by Weiss and Von Haeseler (1998) and Prichard

*et al.*(1999). By (1), the joint density under the exponential population growth is given as follows: 2As α tends to infinity, we obtain 3If we set

*y*=

_{i}*x*−

_{i}*t*, the limiting joint density of

_{e}*y*as α tends to infinity is the same as the joint density under the constant population size. This limiting joint density indicates that there is no coalescence from the present to time

_{i}*t*, and the coalescent process before the population expansion is the same as the process under the case of constant population size with size

_{e}*N*

_{A}. In other words, the limiting genealogy is represented by adding a branch of length

*t*to each external branch of the genealogy under constant population size (Figure 1A). We call this limiting genealogy the rapid-growth-limit genealogy. We can also calculate the limiting values of the expectation of

_{e}*T*

_{n}_{−}

*(0 ≤*

_{m}*m*<

*n*− 1) as α tends to infinity, 45By (4) and (5), the limiting value of the expectation of

*T*can be computed as 6The second term of the right-hand side of (6) is equal to the expectation of

*T*under constant population size. When the population size is large, it can be thought intuitively that any coalescences are difficult to occur in the expansion period, so the rapid-growth-limit genealogy has no coalescence in that period.

### Averages and variances of the statistics:

We study properties of six test statistics with a given mutation rate under three different genealogies, the genealogy for the exponential population-growth model (the “simulated genealogy”), genealogy constructed from the limiting joint density of coalescence times as α tends to infinity (the rapid-growth-limit genealogy), and the star genealogy. For the rapid-growth-limit genealogy and star genealogy, we can compute approximate averages and variances of the test statistics as shown below.

#### Rapid-growth-limit genealogy:

We divide the rapid-growth-limit genealogy into two parts, a genealogy with constant population size (constant part) and added branches (added part) as illustrated in Figure 1A. Assume that *X _{i}* (1 ≤

*i*≤

*n*) is the number of mutations in the added branch of the

*i*th sample and write . The number of nucleotide differences,

*d*, between the

_{ij}*i*th and

*j*th samples is given by where

*d*

^{}

_{ij}denotes the number of nucleotide differences between the

*i*th and

*j*th samples in the constant part. Then the average number of pairwise differences between sample sequences

*k̂*can be written as where

*k̂*

^{(}

^{c}^{)}denotes the average number of nucleotide differences due to the constant part. The distribution of each

*X*is approximately Poisson with a mean μ

_{i}*t*(Kingman 1982). By letting θ = 2

_{e}*N*

_{A}μ and using

*d*(

_{i}*i*= 1, 2) given in the appendix, the average and variance of

*k̂*are given as 78since the covariance term vanishes. Although the mutation parameter θ is usually defined using the current population size, we here define it using the initial size

*N*

_{A}since the current population size tends to infinity in the rapid-growth limit. The number of segregating sites

*S*in the sample is written as where

*S*

^{(}

^{c}^{)}denotes the number of segregating sites in the constant part. The expectation and variance of

*S*are given by 910Similarly, the covariance between

*k̂*and

*S*is given by 11Therefore, by (7)–(11), we can obtain the expectations of the numerator and the square of the denominator of

*D*

_{T}as 1213An approximate expression for the expectation of

*D*

_{T}is obtained by dividing (12) by the square root of (13). The approximation and the δ-method used later give good estimates if the coefficients of variation of respective variables are much smaller than one. We check the accuracy of the approximation by simulation later. Similarly, we obtain the expectations of the numerators and squares of the denominators of the other test statistics except

*R*

_{2}as 1415161718When we calculate values for

*D*

_{FL},

*D**,

*F*, and

*F**, we put (

*u*,

_{D}*v*), (

_{D}*u*

_{D}_{*},

*v*

_{D}_{*}), (

*u*,

_{F}*v*), and (

_{F}*u*

_{F}_{*},

*v*

_{F}_{*}), respectively, into (

*u*,

*v*) of (18). The expectation of the denominator of

*R*

_{2}is given by (9) and the expectation of the square of the numerator can be calculated by using the equation 19

In those calculations, we used an approximation of a form, If *x*_{1} and *x*_{2} are independent, the left-hand side of the equation equals *E*[*x*_{1}] × *E*[1/*x*_{2}] and this shows the right-hand side of the equation underestimates its left-hand side in absolute values. Although we know *x*_{1} and *x*_{2} are dependent, this argument seems to hold approximately as shown in Table 1. Indeed, absolute values of most estimated values for the rapid-growth-limit genealogy are smaller than the values of simulations.

We can also compute covariances between any pair of random variables *k̂*, *S*, η* _{e}*, η

*, and ξ*

_{s}*; then we are able to obtain approximate estimates of variances of test statistics by the δ-method, 20(Stuart and Ord 1987).*

_{i}#### Star genealogy:

Since all mutations are singletons in the star genealogy (Figure 1B), we have a relation between *k̂* and *S*, and obviously η* _{e}* = η

*=*

_{s}*S*hold. Then the test statistics can be written as a function of

*S*as follows: 2122232425Let

*Z*(

_{i}*i*= 1, 2, … ,

*n*) be the number of mutations on the external branch of the

*i*th sample. We can approximate the distribution of

*Z*as Poisson with mean μ

_{i}*t*(Kingman 1982); then the distribution of is also Poisson with mean

_{e}*n*μ

*t*. Because (21)–(25) contain only one random variable

_{e}*S*and each equation has the same form, 26it is easy to obtain distributions of the test statistics except for

*R*

_{2}. The distribution is given by 27where To compute the statistic

*R*

_{2}, however, we need each value of

*Z*, so it is not possible to express it as a function of

_{i}*S*. The expectation of

*R*

_{2}is represented by 28and the variance can be obtained approximately by the δ-method.

## SIMULATION RESULTS

We examined effects of population growth on the expectations, variances, shapes of distributions, and critical values of the six test statistics for various values of *n*, *t _{e}*, α, and θ = 2

*N*

_{A}μ. We simulated the polymorphism data by using methods based on Hudson (1990). For the rapid-growth limit, we generated its genealogy by adding branches to the constant genealogy and then assigned mutations. Although we investigated the distribution for several parameter sets, typical examples are shown in the figures with fixed parameters

*n*= 20 and θ = 1.

Figure 2 shows the effect of α on the distributions of each test statistic with *t _{e}* = 3. The parameter α represents the magnitude of expansion. It can be observed that the distributions of all statistical tests are shifted toward the negative side, kurtosises of distributions become higher, and shapes of the distributions converge to those of the rapid-growth-limit genealogies with increasing α. We can see that

*R*

_{2}has the narrowest distribution, and the variances converge to those of the rapid-growth-limit genealogies more rapidly in

*D*

_{T}and

*R*

_{2}than in the other test statistics. The distributions under the star genealogy are not shown in this figure since the variances are so small that the distributions are much narrower than those in the case of the simulated and rapid-growth-limit genealogies. In Figure 3, the distribution of

*D*

_{T}when one of the parameters

*n*, θ, or

*t*is changed is illustrated. Generally the same tendency as those illustrated in Figure 2 can be seen in those distributions. Although the convergence to the rapid-growth limit is slightly slower for smaller

_{e}*t*, the distributions converge to the limits (see also Table 1). Small sample size and short time length of the expansion result in higher kurtosises of the distribution, and smaller mutation rates make the distribution more rugged.

_{e}The averages and variances of test statistics under exponential population growth and star genealogy are presented in Table 1. Unless the values of the parameters *t _{e}* and α are small, the values for the rapid-growth-limit genealogy are similar to the values of the simulated genealogy in both average and variance. Moreover, the averages for the star genealogy are similar to those for the simulated genealogy, but their variances are one order less than those of the simulated and rapid-growth-limit genealogies. We also note a difference among test statistics in the amount of the reduction of variances caused by population expansion. The variance of

*R*

_{2}is 0.006 for constant population size and 0.0004 for the rapid-growth-limit genealogy when

*t*= 1. Therefore, ∼93% reduction is recognized. In the other test statistics (

_{e}*D*

_{T},

*D*

_{FL},

*D**,

*F*, and

*F**), amounts of reduction are <75% (74, 45, 59, 50, and 40%, respectively).

We also investigated the power of the test statistics and some of the results are shown in Figure 4. First, data were generated assuming the exponential-growth model and then 95% confidence intervals for given *S* of respective statistics were obtained to compute their powers by computer simulation. We do not recognize much difference among the powers of the test statistics except *R*_{2}. In Ramos-Onsins and Rozas (2002), *R*_{2} was characterized as the most powerful test statistics for sudden and logistic population-growth models, and Figure 4 shows that *R*_{2} is also most powerful for the exponential population-growth model. Our analysis indicates that the more powerful test statistics are characterized by larger reduction of variances in the population-growth models.

In Figures 2 and 3, we examined the convergence of distributions of the statistics to those of the rapid-growth limit increasing α independently. However, sometimes we may have some information about interrelationship among parameters. For example, Pluzhnikov* et al.* (2002) state that high growth rates are compatible with the data of a human population for only small growth periods and vice versa. Although such interdependencies among parameters are not usually simple, here we investigate the convergence of the distributions as α grows, keeping α*t _{e}* constant as a simple example of interdependency. Recall that

*t*is the length of the growth period and α is the rate of increase of population size, so the ratio of the present population size to initial population size is represented by

_{e}*e*

^{αte}; namely, the present population size is represented by

*N*

_{A}

*e*

^{αte}. Figure 5 shows the results with α

*t*= 10. In all statistics, the changes of modes of distributions are largest in the case of α = 0.5,

_{e}*t*= 20. For the distributions to be shifted toward minus, mutations must accumulate after the expansion starts. Since not many mutations accumulate if

_{e}*t*is short, changes of the modes will be small in these cases.

_{e}## DISCUSSION

In this article, we investigated the limiting behavior of test statistics of neutrality and obtained approximate formulas for their moments in the limits. In the limiting case, a population, originally of size *N*_{A}, instantaneously increases to an infinite size at time *t _{e}* ago so that no common ancestry can occur in the growth period. Although such extreme cases may be found rarely in nature, we can gain some insights into the previous simulation results obtained and intuitive arguments given by other authors from these formulas.

Pluzhnikov* et al.* (2002) surveyed 10 unlinked noncoding regions of individuals from three human populations. To analyze these data, averages and variances of a few test statistics were estimated using coalescent simulation. From the simulation, they found that recent population growth shifts the distribution of *D*_{T} and *D** toward smaller values and the variance of *D*_{T} decreases as the time, *t _{e}*, elapsed since the expansion event increases. We can explain these behaviors on the basis of our analytical results. As shown in the previous section, the expectations of test statistics are written as functions of

*t*by using Equations 12–18. From these expressions, we can calculate their derivatives with respect to the variable

_{e}*t*. For example, 29where

_{e}*F*denotes the right-hand side of (13) and is positive. This expression is negative for any

_{n}*n*(≥2). In the case of

*D*

_{FL}, 30and this also takes a negative value. Then we can see that those expectations are monotone decreasing with increasing

*t*. For the other test statistics, we can show the same behavior (equations not shown). Besides, monotone decrease of approximate variances obtained by the δ-method can be shown similarly. Of course, since the monotonicity with regard to

_{e}*t*shown here is for the rapid-growth limit, caution must be taken to extrapolate the results to cases with finite α but we expect the behavior of the limit reflects at least those when α is large.

_{e}Generally, it has been considered that sampled genes will have a star-like genealogy if there was rapid population growth (Kaplan* et al.* 1989). Also Marjoram and Donnelly (1994) pointed out that most coalescences of ancestors occur within a relatively short period under the exponential growth of the population size, and this makes a genealogy star-like. In fact, the star-like genealogy is a special case of the rapid-growth-limit genealogy, so we consider in what situation the star-like genealogy will be obtained. If *t _{e}* satisfies then, from (12) and (13), the expectation of

*D*

_{T}under the case of the rapid-growth-limit genealogy is approximately given by 31Since the expectation of

*S*is equal to

*n*μ

*t*in the rapid-growth-limit genealogy, (31) is approximately equal to the expectation of

_{e}*D*

_{T}under the case of the star genealogy given by (21). Recall that time is measured in units of

*N*

_{A}generations. So 1/

*t*→ 0 amounts to

_{e}*N*→ 0.

_{A}We can show the same convergence for the variance. Arguments go similarly for the other test statistics introduced in theory and methods. So the star genealogy can be considered as “the secondary limiting genealogy” as *N*_{A} tends to infinitely small, so that all common ancestry events occur instantaneously.

In this article, only the case of exponential growth was considered thus far but the method can be extended to the sudden population growth case. Let α* _{s}* be the ratio of the population size after the growth to that before the growth and

*t*be the time since the growth event. Then, the joint density,

_{e}*g*(

*t*, … ,

_{n}*t*

_{2}), is expressed as 32where It is obvious that the limiting joint density in this case equals (3) as α

*tends to infinity, so the limiting cases of sudden growth and exponential growth are the same. However, if α and α*

_{s}*are finite, effects of changing*

_{s}*t*are different in (2) and (32). In both equations, the density takes the last expression (

_{e}*t*<

_{e}*x*) if

_{n}*t*= 0 and has the same form as that of the constant-size case. As

_{e}*t*increases, the density starts to differ from that of the constant case and the power to detect deviations increases. However, note that the density (32) for sudden population growth is not a function of

_{e}*t*when

_{e}*x*

_{2}<

*t*and has the same form as that for constant population size. Therefore, the probability that

_{e}*x*

_{2}<

*t*constantly increases as

_{e}*t*increases and eventually the density converges to that of the constant size. Thus, the power to detect deviations decreases as

_{e}*t*tends to infinity. This is why Simonsen

_{e}*et al.*(1995) and Ramos-Onsins and Rozas (2002) found peaks of the power of test statistics under sudden population growth at intermediate values of

*t*. Under exponential growth, it is difficult to predict the behavior of the power of test statistics from (2). However, the behavior of the power as

_{e}*t*changes for large α is expected to be similar to that for the limiting case. Since both the means and variances of the statistics decrease as

_{e}*t*increases in the rapid-growth-limit genealogy, the power increases as

_{e}*t*increases in the limiting case. Therefore, the effect of

_{e}*t*on the power in exponential growth is considered very different from that in sudden growth. Note that a logistic growth is a more realistic population-growth scenario for many organisms and the degree of shift of distribution should be small under a logistic-growth model.

_{e}## APPENDIX

## Acknowledgments

We thank M. Iizuka for discussion and comments on the manuscript. We also thank two anonymous referees for their helpful comments. This work was partially supported by a grant-in-aid for scientific research on priority areas, “Medical Genome Science” no. 15012239.

## Footnotes

Communicating editor: J. Wakeley

- Received June 24, 2004.
- Accepted November 13, 2004.

- Genetics Society of America