DISTIQUE is a python based tool for inferring species trees from gene trees. It is a forced acronym for Distance-based Inference of Species Trees using Induced QUartet Elements.
The basic idea behind DISTIQUE is to infer the species tree from gene trees by computing a distance between two species from the set of quartets that include the two species. We have found ways to do this in a statistically consistent manner. DISTIQUE introduces a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves
It first finds a distance matrix based on the quartet trees induced by gene trees, and then infers the species trees using a distance-based method like FastME1 or PhyD* 2.
You could install DISTIQUE in a couple of steps:
DISTIQUErepository is placed. There are a couple of dependencies that you need to install at this step as well.
sudo pip install "dendropy>=4"
. If you don't have root access, you would install DendroPy locally by using the command pip install "dendropy>=4" --user
.PhyD* is another distance based method that could handle missing values in distance matrix reasonably. You could download it from this link, then put it under your WS_HOME. The default distance based software is FastME.
Quartet code from this thesis. Make sure that you set the QUARTETS compiler flag. After compilation place the executable code quart_bin under DISTIQUE/bin. This code is used for DISTIQUE-all-pairs.
The set of tools at this git repository. They are written in different languages, and different versions of Dendropy. Download or clone this repository and put it under your WS_HOME. Some of the bash script tools in DISTIQUE depend on these set of tools.
The main file to use DISTIQUE is available under DISTIQUE/src/utils/distique-2.py
Usage: distique.py [options]
Options:
-h, --help Show this help message and exit
-t STRAT, --strategy=STRAT The version of DISTIQUE to be run 1 (all-paris, prod),
2 (all-paris, max), 3 (Distance-sum), and 4 (Tree-
sum), default is DISTIQUE Distance-SUM (3)
-f FILENAME, --file=FILENAME Read quartet table from FILENAME
-g GT, --gene=GT Read genetrees from FILENAME
-o OUT, --output=OUT The PATH to write the generated files
-y THR, --threshold=THR The minimum frequency that consensus will use. Default is 0.5
-v VERBOSE, --verbose=VERBOSE Verbose
-u SUMPROG The distance based program to find species tree from distance matrix.
The options are ninja, fastme, phydstar. Default is fastme
-z SUMPROGOPTION The distance method to build the tree. If sumProg is set to fastme
the options are TaxAdd_(B)alME (-s) (Default), TaxAdd_(B2)alME (-n),
(D) default of fastme, TaxAdd_(O)LSME (-s), TaxAdd_(O2)LSME (-n),
B(I)ONJ, (N)J. The default in this case is TaxAdd_(B)alME. if the
sumProg is set to phydstar, The options are BioNJ, MVR, and NJ.
-s SP, --sp=SP Species tree
-n NUM, --numStep=NUM The #rounds of anchoring, default is 2
-r OUTLIER The strategy for outlier removal. This options is only effective with tree-sum.
The options are pairwise1, pairwise2, consensus10, or consensus3.
Default is consensus3
-x SUMMARY The summary method that will be used to summarize inferred species trees.
Only effective with tree-sum. Default is MRL.
-a AV, --averagemethod=AV The strategy to find the average quartet tables.
Options are geometric mean (gmean), averaging (mean),
or root mean square (otherwise). Default is mean.
-p MET The method to summarize quartet results around each
node, freq, or log, Default is freq
-l FILLMETHOD The method to fill empty cells in distance tables, used with distance-sum
or tree-sum. Options are const, rand, or normConst. Default is const
-m METHOD, --distmethod=METHOD The method to compute distances between pairs of species.
Options are prod (default) or min.
There are mainly three datasets that we used in this paper.
These two datasets used to evaluate performance of DISTIQUE on datasets with relative small number of species, varying ILS, and number of genes. For more details about these two datasets visit link
We analyzed these two datasets with different methods. Inferred species trees are available here. The methods that we used for our analyses are ASTRID (NJst)3, ASTRAL, DISTIQUE-AAD, DISTIQUE-AMD, DISTIQUE all pairs inferred with different distance methods (i.e. different methods implemented in FastME, and PhyD*), DISTIQUE distance sum with 1, 2, 4, and 8 rounds of anchor sampling around each polytomy, and DISTIQUE tree sum with 1, 2, 4, and 8 rounds of anchor sampling around each polytomy.
The RF distances between true and inferred species trees are available at this link for avian and mammalian, and here for simphy dataset (in csv format). You would find the running times for the simphy dataset at the following link.
** Important Note:** All of the results published in this paper, are produced with this release of DISTIQUE. In this version of DISTIQUE, a specific random seed number was not, and as a result, there is some randomness across different runs of DISTIQUE on the same exact dataset. This can translate to small changes in the final results of the DISTIQUE-distance-sum and the DISTIQUE-tree-sum algorithms. The lack of determinism typically results in very small changes. In future versions of DISTIQUE, we plan to specify a random seed number, such that the results become reproducible.
python distique.py -a mean -m prod -t 1 -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]
python distique.py -a mean -m prod -t 2 -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]
python distique.py -a mean -m prod -t 2 -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]
python distique.py -a mean -m prod -t 3 -u fastme -z B -g [GENE TREES] -o [OUTPUT DIR]
python distique.py -a mean -m prod -t 4 -u fastme -z B -g [GENE TREES] -o [OUTPUT DIR]
python ASTRID.py −m fastme2 −i [GENE TREES] −o [OUTPUT SPECIES TREE]
*ASTRAL
java −Xmx2000M −jar astral.4.7.8.jar −i [GENE TREES] −o [OUTPUT SPECIES TREE]
We reanalyze the avian biological dataset of Jarvis et. al. This dataset is a dataset of 2022 supergene trees from an avian dataset.5 The resulting species trees are available at biological species trees.
Contact Erfan Sayyari
Lefort, Vincent, Richard Desper, and Olivier Gascuel. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program.
Molecular biology and evolution 32.10 (2015): 2798-2800. ↩
Criscuolo, Alexis, and Olivier Gascuel. Fast NJ-like algorithms to deal with incomplete distance matrices.
BMC bioinformatics 9.1 (2008): 166. ↩
Vachaspati, Pranjal, and Tandy Warnow. ASTRID: accurate species trees from internode distances.
BMC genomics 16.Suppl 10 (2015): S3.ASTRID ↩
Mallo, Diego, Leonardo De Oliveira Martins, and David Posada. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees.
Systematic biology 65.2 (2016): 334-344. ↩
Jarvis, Erich D., et. al. Phylogenomic analyses data of the avian phylogenomics project.
GigaScience 4.1 (2015): 1-9. ↩