Software for estimating confidence in biological identifications made with DNA sequences.
Identification of biological materials by the DNA contained in them is an increasingly common procedure. It is very useful in situations where identifications based on morphology are not practical, such as in identification of material after it has been consumed by a predatory animal (e.g. Jarman et al., 2004); identification of 'ancient' DNA (e.g. Weinstock et al., 2005); or identification of larval stages that defy morphological identification (e.g. Knott et al., 2003). It is also an increasingly popular choice for scientists who do not have taxonomic skills or appropriate keys to identify organisms that have had DNA sequence 'barcodes' placed in databases (Hebert et al., 2002; Tautz et al., 2003). The 'consortium for the barcoding of life' (http://www.barcodinglife.org/) is attempting to obtain diagnostic DNA sequences for all macroscopic life. Another significant DNA-based identification project is the 'DNA surveillance' project initially established to identify the species of whale meat sold in public markets. The software used for this has been made available for identification of biological material from any taxonomic group.
The methods used to identify the taxon that a DNA sequence should be assigned to are not yet standardised. The most common strategy currently used is to apply phylogenetic methods to place query sequences in a clade with sequences from known taxa. DNAID provides an alternative to phylogenetic methods for identification of DNA sequences. DNAID ('DeoxyriboNucleic Acid IDentification') implements one of the simplest approaches to determining the affinity of a DNA sequence. It uses 'phenetic' comparisons of DNA sequences, rather than the 'cladistic' ones used by most modern phylogenetic inference software. Phenetic comparisons between DNA sequences can be considered appropriate for species identification because it is not necessary to know the evolutionary history of the DNA sequences being compared. For identification purposes, we are only interested in how similar DNA sequences are today, not how they have evolved in the past.
There are several advantages to using this simpler approach. One is that it is far less computationally demmanding to determine similarity between DNA sequences without having to estimate how they might be organised into clades. In phylogenetic terms, DNAID estimates a number of two-taxon trees and ranks them to find the shortest tree. In pure phylogenetic approaches, a tree with N taxa is estimated with relationships between N terminal nodes and N-1 internal nodes also being estimated. This quickly becomes time consuming when analysing datasets containing large numbers of sequences. Even highly efficient algorithmic methods such as Neighbor Joining as implemented in Phylip require a time proportional to the cube of the number of taxa to construct the tree. Removing the tree building step dramatically reduces the computing time needed to look for DNA sequence matches in large datasets, while still telling us what we need to know for taxonomic assignment.
To use DNAID, the user provides a set of DNA sequences from a range of samples of known origin, along with a DNA 'query' sequence from the orthologous DNA region derived from the biological material to be identified. These sequences can be aligned already, or if there is a version of ClustalW or ClustalX (Thompson et al., 1997) installed on the same computer, DNAID can use this to align sequences from within its own interface. DNAID allows individual sequences to be manipulated by reversing, complementing or reverse complementing them.
The user must also provide a taxonomy file, which describes the hierarchical taxonomic relationships between the taxa to which the query sequence is being matched. The user then assigns DNA sequences from known sources to taxa and assigns one of the sequences of unknown origin as a 'query' sequence. Only sequences that have been assigned to taxonomic categories are analysed. DNAID can import aligned DNA sequences produced by ClustalW or ClustalX; or unaligned sequences in Fasta format. Alignments that have been altered in DNAID can be saved in Fasta, MEGA(Kumar et al., 1997), Clustal (.aln), PAUP (Swofford, 2003) or Phylip (Felsenstein, 1993) formats.
Bootstrapped Genetic Distance
Genetic distance between a 'query' sequence and a range of sequences provided by the user can be determined. Currently implemented models of genetic distance are, in increasing level of complexity: percentage difference; Jukes and Cantor's (1969) model; Kimura's (1980) 2-Parameter model; Felsenstein's (1981) model; and Tamura and Nei's (1993) model. The DNA distances between the sequences are also estimated on pseudoreplicate datasets produced by nonparametric bootstrapping (Efron and Tibshirani, 1993) in order to estimate how informative the DNA region being used is for taxonomic identification in that group. Results are given as the proportion of bootstrap replicates in which each of the sequences being compared to the query sequence are the closest match to it. Results are also returned in terms of how often members of each member of each taxonomic level are the closest match to the query sequence.
Bootstrapping the genetic distance calculations has two desirable properties as a means of estimating confidence in taxonomic identifications. Firstly, in situations where a query sequence doesn't perfectly match known DNA sequences and there is overlap of intraspecific variation in sequence between closely related species, lowered bootstrap values will result because a query that closely matches DNA from multiple taxa will most closely match a different one in different bootstrap pseudoreplicate datasets. Secondly, a query that perfectly matches DNA from more than one taxon will produce lowered bootstrap values as an equally close match is 'split' between the taxa.
Identification of Diagnostic Sequence Motifs
An appealing way to identify taxa by their DNA is to identify sequence motifs that are unique to taxa of interest. This can be done by comparing DNA sequences from a 'target' taxon and DNA sequences from other 'non-target' taxa. Provided that an appropriate range of sequences from non-target taxa have been included in the analysis, the unique combinations of sites in the target taxon can be used as a marker for identifying that taxon.
To determine how confident we can be that a given sequence motif is a good taxonomic identifier, multiple sequences from different idividual representatives of the taxon must be included in the analysis. The likelihood that the motif is a unique identifier can then be calculated from the geometric distribution of probabilities as the presence or absence of the motif in each member of a taget taxon is a Bernoulli trial. The more times we have sampled known members of a taxon and found a characteristic combination of nucleotides (passing the Bernoulli trial), the more confident we can become that future incidences of this motif represent the presence of the target taxon.
Measurement of Taxonomic Sampling
A very important feature of how confident we can be in a DNA-sequence-based identification is how much information we have about how that DNA sequence varies within and among species. If a dataset has only a low diversity index for sampling of a given taxon, then we cannot be confident either that we have found the closest match (even if the match is perfect); or that a closer match might have existed if sampling was more complete. DNAID assesses taxonomic sampling by determining the Shannon-Weaver diversity index for any parent taxon by counting the number of times each child taxon is represented in the dataset.
Synthesis of Artificial Sequences
DNAID has a facility for synthesising sequences with random variation introduced on the basis of a rate matrix either entered by the user or calculated empirically from a set of 'seed sequences.' Each of the synthesised sequences in based on one of the seed sequences chosen at random and then mutated at random based on the probabilities of each possible type of nucleotide change in the rate matrix.
This option allows the user to explore how different sets of 'fake' DNA sequences might affect identification confidence. Simulation of data that doesn't exist may help to formulate sampling strategies for obtaining enough real data to provide a desired level of confidence.
A manual for DNAID is available online at DNAID_manual.
DNAID is written in Python 2.3 and utilises the Qt 3.3 graphics library (Trolltech) mediated by the PyQt wrapper (RiverBank computing). DNAID is primarily developed for Linux platforms, but versions for MacOSX and Win32 platforms will also be produced in time.
An online manual with detailed instructions for using DNAID will be available here.
The software is released under the GNU General Public License version 2.
Bug reports, suggestions or requests for features can be directed to email@example.com.
The current version of DNAID is DNAID_Linux_a1.py To run this software under Linux, you must have the following installed:
Python 2.3 ,which is included with most popular Linux distributions.
Qt 3.3, which is included as part of the KDE desktop. For non-KDE distributions, it can be installed from here.
PyQt, which can be downloaded from here.
DNAID_Linux_a1.py can then be run by:
Extracting the downloaded archive and saving the resulting DNAID_Linux_a1.py file in a handy location
Either opening it from your favourite Python IDE; or opening a terminal and typing:
python /home/simon/DNAID_Linux_a1.py ...or something similar, depending on the location of the file
The first Mac version of DNAID will be DNAID_Mac_a1.py To run this software the following must be installed:
Python 2.3. This is pre-installed on MacOSX.3 and above. Users of older MacOSXs should download and install Python 2.3
PyQt for MacOSX, which can be downloaded and installed from here.
Not currently implemented. A solution will hopefully be provided soon.
DNAID can be downloaded from its sourceforge project page http://sourceforge.net/projects/dnaid.
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall, Boca Raton.
Felsenstein, J (1981) Evolutionary trees from dna sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17: 368-376.
Felsenstein J (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.
Hebert PDN, Cywinska A, Ball SL, deWaard JR (2002) Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B 270: 313-321.
Jarman SN, Deagle BE, Gales NJ (2004) Group-specific polymerase chain reaction for DNA-based analysis of species diversity and identity in dietary samples. Molecular Ecology 13: 1313-1322.
Jukes, TH, Cantor, CR (1969) Evolution of protein molecules. In Munro, HN, editor, Mammalian Protein Metabolism, pages 21-123. Academic Press, New York.
Knott KE, Balser EJ, Jaeckle WB, Wray GA (2003) Identification of asteroid genera with species capable of larval cloning. Biological Bulletin 204: 246-255.
Kumar S, Tamura K, Jakobsen IB & Nei M (2001) MEGA2: Molecular Evolutionary Genetic Analysis Software. Bioinformatics 17: 1244-1245.
Kimura, M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16: 111-120.
Swofford DL (2003) PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10: 512-516.
Tautz D, Arctander P, Minelli A, Thomas RH, Vogler AP (2003) A plea for DNA taxonomy. Trends in Ecology and Evolution 18: 70-74.
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 25: 4876-4882.
Weinstock J, Willerslev E, Sher A,
Tong W, Ho SYW, Rubinstein D, Storer J, Burns J, Martin L, Bravi C,
Prieto A, Froese D, Scott E, Xulong L, and Cooper A (2005) Evolution,
systematics, and phylogeography of Pleistocene horses in the New
World: A molecular perspective. PLoS
Biology 3: e241.