|
Microsatellite
DNA provides essentially limitless, highly varied information within
species. That this provides a means for distinguishing not only
among populations but also individuals has not escaped current theoretic
interest (Smouse and Chevillon 1998, Waser and Strobeck 1998).
Here, we present a C++ computer program named WHICHRUN that uses
multilocus genotypic data to allocate individuals to their most
likely source population.
Requirements
Program runs on
Windows95, 98 or NT (including Macintosh emulations of these operating
systems) and has no specific hardware requirements.
Input File
WHICHRUN requires
baseline genotype data for all potential source populations as well
as genotype data for candidate individuals for which population
origin is to be determined. Data should be provided in simple ASCII
format as required for GENEPOP
(Raymond and Rousset 1995). WHICHRUN's help file describes preparing
input data in detail, and the download includes a sample baseline,
unknown and output file.
Whichrun Bayes (v4.2)
If you have Whichrun
4.0, be sure to download version 4.2, which corrects an error that
was affecting the jackknife results.
Theory and Program Outline
It is assumed
that each baseline population (B1..Bk) has
Hardy-Weinberg-Castle (H-W-C) genotype frequencies and that genetic
loci employed are independent. The likelihood that an individual
sample (s1..n) may come from each of the source populations
(B1.. k) is presumed to be equal to the H-W-C frequency
of its specific genotype at each locus in each respective source
population. Thus, for homozygotes the likelihood that a sample
(s1) is an element (e) of baseline population B1 is p12
(the square of its allele frequency (p1) in population
B1) or for heterozygotes, s2 e B1 = 2 p1q1
(q1 being the frequency of an alternate allele in population
B1) and the likelihood that sn e
Bk = pk2 or 2 pkqk.
Likelihood values for each locus are multiplied to give a series
of multi-locus likelihood functions for assignment to each of the
source populations. Alternate hypotheses that individual samples
in question may come from each source population are considered
in three ways:
1. Multi-locus likelihood
functions may be grouped to form ratios considering all possible
pairs of baseline populations under consideration. If the ratio
of the most likely allocation grouped with the second most likely
allocation approaches one, there is ambiguity in the assignment
of the particular sample under study. Conversely, samples for which
this ratio yields a large result in comparison to all other ratios
can be assigned to a single population with more confidence. For
the two populations considered in the ratio, the chance of error
is equal to the inverse of this ratio. Stringency for population
allocation can be applied by defining a selection criterion for
the log10 of this ratio. For example, by selecting only
assignments that have a log of the odds (LOD) ratio of at least
2, all results will have a 1/100 chance of error or less.
2. Multi-locus likelihood
functions may be grouped according to maximum likelihood format
according to the equation L(n)/L(max). This yields a series of
ratios between 1 (most likely) and close to 0 (least likely). Analysis
of variance of log transformed data followed by a Tukey’s multiple
comparison enables evaluation of statistical significance in the
classical sense.
3. Jackknife iterations
provide an empirical means for evaluating baseline data and the
chances of correct allocation. Iterations sample individuals from
the baseline one at a time, recalculating allele frequencies in
the absence of each individual genotype sampled before determining
most likely population origin for that individual. Experimenting
with alternate loci and populations enables one to determine which
population comparisons and loci combinations enable reliable population
re-allocation.
Reporting options and special
cases
Sample ID, genotypic
data, and multilocus likelihoods for population allocation can be
displayed for verification. A critical population routine allows
one to select a target population for calculation of LOD scores.
All scores are then calculated with the critical population as the
numerator in the ratio. A special case where test samples may have
an allele or pair of alleles not observed in one or all of the baseline
populations are treated as follows. For source populations in which
the allele is not observed an estimated allele frequency of 1/(2N
+ 1) is applied. This hypothesizes that the non-observance of the
allele in question is due to sampling error and that the allele
in question would have been observed in the baseline population
if one more allele had been sampled. Note that this estimation
may introduce substantial bias if baseline population size (N) is
small as would be likely for any allele frequency estimation given
small N, particularly when dealing with highly polymorphic marker
types. The program implements a warning describing this consideration
when small baseline population sizes (N < 30) are encountered.
Alternatively, if sampling error is low, an unknown sample allele
not observed in a baseline population may constitute strong evidence
that the sample in question may indeed not originate from the particular
baseline population under consideration. Any alleles for which
the 1/(2N+1) estimation is necessary are noted on the genotype output.
It is obvious
that a technique such as WHICHRUN will only be effective if there
is reasonable reproductive isolation among populations under consideration.
Three other considerations are also important. First, the rate
of accumulation of variance for molecular loci employed should be
closely matched with estimated divergence times among populations
under study. For example, highly polymorphic microsatellites prone
to homoplasy would not be suited for diagnosis among populations
that have been diverged for substantial evolutionary time. However,
highly polymorphic microsatellites are likely one of a few molecular
marker types that have sufficient information to resolve diagnosis
among recently diverged populations such as the global radiation
of Drosophila melanogaster which is estimated to have occurred
within the last 10,000 to 15,000 years (David and Capy 1988; Benassi
and Veuille 1995). Second, the accuracy of determination is crucially
dependent upon the lack of differential sampling error among baseline
allele frequencies. While this problem is partially addressed
through ensuring that sample size is equal for all populations,
highly polymorphic marker types such as microsatellites require
substantial sampling. Third, for population origin diagnoses where
source populations are recently diverged, there will be a number
of loci that have not accumulated differences in the time since
divergence. As a result, simply increasing the number of loci employed
may not necessarily increase the power of diagnosis. For closely
related populations, additional loci that have marked differences
in allele frequency profiles among populations will be necessary
to achieve increased power.
Authors
Michael A. Banks1 and Will Eichert2
1 Hatfield Marine Science Center, Oregon State
University, 2030 S. Marine Science Drive, Newport, OR, 97365, USA
2 Bodega Marine
Laboratory, University of California at Davis,
Bodega Bay, CA 94923 USA
Michael.Banks@oregonstate.edu
wfeichert@ucdavis.edu
Published reference
Banks, M.A. and W. Eichert. 2000. WHICHRUN
(Version 3.2) a computer program for population assignment of individuals
based on multilocus genotype data. Journal of Heredity. 91:87-89.
Note: Copyright
has been awarded to the American Genetics Association.
Thanks
From The Bodega Marine Laboratory,
University of California at Davis, P.O. Box 247, Bodega Bay. USA.
We thank V.K. Rashbrook, H.A. Fitzgerald and J. Olsen, for beta
testing various versions of this program and a number of useful
suggestions and improvements that resulted from our collaboration.
We are also grateful to F.J. Saminiego for discussion on statistical
aspects during the development of WHICHRUN. Research and development
of WHICHRUN was supported by funds attained from the California
Department of Water resources and the US Fish and Wildlife Service.
References
Benassi, V.
and Veuille, M. 1995. Comparative population structuring of molecular
and allozyme variation of Drosophila melanogaster Adh between
Europe, West Africa and East Africa. Genetics Research. 65:95-103.
David, J.R.
and Capy, P. 1988. Genetic variation of Drosophila melanagaster
natural populations. Trends in genetics. 4:106-111.
Raymond, M.
and Rousset, F. 1995. GENEPOP (Version 1.2): Population genetics
software for exact tests and ecumenicism. Journal of Heredity.
86:248-250.
Smouse, P.E.
and Chevillon, C. 1998. Analytical aspects of population-specific
DNA fingerprinting for individuals. Journal of Heredity. 89:143-150.
Waser, PM,
and Strobeck, C. 1998. Genetic signatures of interpopulation dispersal.
Trends in Ecology and Evolution. 13:43-44.
*Updated
February 17th 2000
[MFGL
HOME]
|