|
Increased information content
from highly polymorphic molecular marker types such as microsatellites
has markedly improved resolving power for discrimination among closely
related populations. This, together with increased automation
of techniques for resolving genetic variation, results in an overall
boon of new information. Individual based methods for assigning
most likely population origins of samples are among the new statistical
techniques emerging to take advantage of this increased amount of
information (Paetkau et al. 1995; Waser and Stroebeck 1998; Banks
and Eichert 2000).WHICHLOCI maximizes population assignment accuracy
through empiric analysis of data drawn from “real” populations.
Trial assignments using data from one locus at a time allows ranking
of loci in terms of their efficiency for correct population assignment
and conversely their propensity to cause false assignments.
Subsequent trials with increasing numbers of loci are then invoked
to determine what minimum number of which specific loci are required
to attain defined assignment accuracies set by the program user.
Overall, WHICHLOCI assesses which combination of loci would provide
the greatest statistical power for population assignment.
Requirements
Program runs on
Windows95, 98, 00 or NT (including Macintosh emulations of these
operating systems) and has no specific hardware requirements.
Input File
The program requires data
from populations under consideration listed either as genotypes
per sample (in the same format used for GENEPOP (Raymond and Rousset
1995, http://www.cefe.cnrs-mop.fr/)
or as allele frequencies per population (in the same format as allele
frequency files created in WHICHRUN (Banks and Eichert 2000).
The program is written to analyze co-dominant as well as haploid
data.
Theory and Program Outline
A resample option allows creation of
test data for all populations under consideration. Computer
generated random numbers specify sampling from an allele table
created from frequency data for each population. This table
consists of an array of alleles observed in each population, repeating
each allele in accord with the frequency of each allele observed
in any population. The user defines how many samples to generate
in this manner and has the option to vary sample size among populations.
Optimum loci combinations that will
match user-defined accuracy for population assignment are determined
through two basic procedures. First, repeated iterations for
assignment of test data using the method applied in WHICHRUN (Banks
and Eichert 2000) are performed employing data from each locus separately,
scoring the number of correct assignments to appropriate source
populations for each locus. This score divided by the total
possible number of correct assignments is then used to rank loci.
A second round of iterations invokes loci from this rank increasing
the number of loci one at a time until the assignment score matches
or exceeds accuracy criteria set by the user. The above description
covers procedure for accuracy considered across all populations.
An alternate, critical population, routine allows focus on accuracy
for assignment to a specific population set by the user. Iterations
using data from each locus separately occurs as above but loci are
scored only according to how many of the trial samples from the
critical population are assigned correctly. Also the number
of samples which might originate from other populations but are
falsely assigned to the critical population are tallied. Rank
order under the critical population routine is determined by applying
the following formula:
LocusScore
= % correctly assigned - (% incorrectly assigned * scoreMultiplier),
where:
% correctly assigned =
% of members of the critical population that were correctly assigned
% incorrectly assigned = # from other populations assigned to
critical population / # from other populations
scoreMultiplier = (100 – User specified accuracy) / User specified
inaccuracy
This allows the user to weight correct
assignment or misses according to how important accuracy or inaccuracy
might be to the application at hand. An allele frequency differential
following methods described in Shriver et al. (1997) can also be
implemented as an alternate means of ranking loci. As above,
a second round of iterations determines empirically how many of
which loci are required to match accuracy criteria.
There has been increasing interest
in the estimation of confidence intervals for assignment results
from individual based methods. Accuracy for this estimation
is obviously closely linked to the accuracy of allele frequency
information for populations under consideration and is addressed
through ensuring that sample sizes among baseline populations matches
estimates required in order to provide accurate allele frequency
for polymorphic marker types (see Banks et al. 2000). The
issue of confidence interval estimation in the context of population
assignment, however, becomes multidimensional given a comparison
between alternate likelihoods that a sample may come from each of
the populations under study. The critical population method
presented above provides a convenient means of summarizing these
multidimensional likelihoods from the perspective of the critical
population. WHICHLOCI provides a means for creating multiple
trial data sets. Summary statistical parameters such as variance,
standard deviation and standard error across results from each data
set are determined following typical formulae (Sokal and Ralph 1995).
A sub-routine written in WHICHLOCI allows users to bypass
the loci ranking routine to determine assignment accuracy, variance,
standard deviation and standard error for a user-selected bank of
loci.
We thus present an empirical method
for determining which specific combination of loci would most likely
provide defined population assignment power for individuals as well
as statistical bounds on the performance of any particular group
of loci. We believe that this method will allow researchers
to maximize power limits in focused population assignment contexts.
Authors
Michael A. Banks1, Will Eichert2
and J.B. Olsen3
1Marine Fisheries Genetics
Laboratory, Coastal Oregon Marine Experiment Station, Hatfield Marine
Science Center, Oregon State University, 2030 SE Marine Science
Drive, Newport,Oregon, 97365-5229,
2The Bodega Marine Laboratory, University of California
at Davis, P.O.Box 247, Bodega Bay 94923-0247 and
3US Fish and Wildlife Service, Alaska Region, Conservation
Genetics Laboratory, 1011 East Tudor Road, Anchorage, Alaska
99503.
Email: Michael.Banks@oregonstate.edu
WFeichert@ucdavis.edu
Jeffrey_Olsen@fws.gov
Note: This program is under review for
Bioinformatics under the title:
Which Genetic Loci have Greater Population Assignment Power?
Thanks
Research and development of WHICHLOCI
was supported by funds attained from CALFED and the California Department
of Water Resources.
References
Banks, M.A., Rashbrook, V.K., Calavetta,
M.J., Dean, C.A. and Hedgecock, D. (2000) Analysis of microsatellite
DNA resolves genetic structure and diversity of chinook salmon in
California’s Central Valley. CJ FAS 57:915-927.
Banks, M.A. and Eichert, W. (2000)
WHICHRUN (version 3.2): A computer program for population assignment
of individuals based on multilocus genotype data. J. of Hered. 91:87-89.
Raymond, M. and Rousset, F. (1995) GENEPOP
(Version 1.2): Population genetics software for exact tests and
ecumenicism. J. of Hered. 86:248-250.
Paetkau, D., Calvert, W., Stirling,
I. and Strobeck, C. (1995) Microsatellite analysis of population
structure in polar bears. Mol Ecol 4:347-354.
Shriver, M.D., Smith, M.W., Jin, L.,
Marcini, A., Akey, J.M., Deka, R. and Ferrell, R.E. (1997)
Ethnic-affiliation estimation by use of population-specific DNA
markers. Amer. J. Hum. Genet. 60:957-964.
Sokal, R.R. and Ralph, F.J. (1995) Biometry.
San Francisco: W.H. Freeman
Waser PM, and Strobeck, C. (1998) Genetic
signatures of interpopulation dispersal.
T. Ecol. Evol. 13:43-44.
[HOME]
|