DISTIQUE

DISTIQUE is a python based tool for inferring species trees from gene trees. It is a forced acronym for Distance-based Inference of Species Trees using Induced QUartet Elements.

The basic idea behind DISTIQUE is to infer the species tree from gene trees by computing a distance between two species from the set of quartets that include the two species. We have found ways to do this in a statistically consistent manner. DISTIQUE introduces a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves

How DISTIQUE infers species trees

It first finds a distance matrix based on the quartet trees induced by gene trees, and then infers the species trees using a distance-based method like FastME¹ or PhyD* ².

How to install DISTIQUE

You could install DISTIQUE in a couple of steps:

Clone to DISTIQUE git repository or download this zip file.
Then you need to set environmental variable WS_HOME to the directory under which this DISTIQUE repository is placed. There are a couple of dependencies that you need to install at this step as well.

Dependencies

FASTME 2.0 is now available at this link. FastME is a distance based method that produces species trees based on the provided distance matrix. Put it under your WS_HOME.
. DendroPy 4.0.3, which is a Python library for phylogenetic computing. You would install it by using the command sudo pip install "dendropy>=4". If you don't have root access, you would install DendroPy locally by using the command pip install "dendropy>=4" --user.

Optional Dependencies

PhyD* is another distance based method that could handle missing values in distance matrix reasonably. You could download it from this link, then put it under your WS_HOME. The default distance based software is FastME.
Quartet code from this thesis. Make sure that you set the QUARTETS compiler flag. After compilation place the executable code quart_bin under DISTIQUE/bin. This code is used for DISTIQUE-all-pairs.
The set of tools at this git repository. They are written in different languages, and different versions of Dendropy. Download or clone this repository and put it under your WS_HOME. Some of the bash script tools in DISTIQUE depend on these set of tools.

How DISTIQUE works

The main file to use DISTIQUE is available under DISTIQUE/src/utils/distique-2.py

    Usage: distique.py [options]
    Options:
        -h, --help                      Show this help message and exit
        -t STRAT, --strategy=STRAT      The version of DISTIQUE to be run 1 (all-paris, prod),
                                        2 (all-paris, max), 3 (Distance-sum), and 4 (Tree-
                                        sum), default is DISTIQUE Distance-SUM (3)
        -f FILENAME, --file=FILENAME    Read quartet table from FILENAME
        -g GT, --gene=GT                Read genetrees from FILENAME
        -o OUT, --output=OUT            The PATH to write the generated files
        -y THR, --threshold=THR         The minimum frequency that consensus will use. Default is 0.5
        -v VERBOSE, --verbose=VERBOSE   Verbose
        -u SUMPROG                      The distance based program to find species tree from distance matrix.
                                        The options are ninja, fastme, phydstar. Default is fastme
        -z SUMPROGOPTION                The distance method to build the tree. If sumProg is set to fastme 
                                        the options are TaxAdd_(B)alME (-s) (Default), TaxAdd_(B2)alME (-n), 
                                        (D) default of fastme, TaxAdd_(O)LSME (-s), TaxAdd_(O2)LSME (-n),
                                        B(I)ONJ, (N)J. The default in this case is TaxAdd_(B)alME. if the  
                                        sumProg is set to phydstar, The options are BioNJ, MVR, and NJ. 
        -s SP, --sp=SP                  Species tree
        -n NUM, --numStep=NUM           The #rounds of anchoring, default is 2
        -r OUTLIER                      The strategy for outlier removal. This options is only effective with tree-sum.
                                        The options are pairwise1, pairwise2, consensus10, or consensus3. 
                                        Default is consensus3
        -x SUMMARY                      The summary method that will be used to summarize inferred species trees. 
                                        Only effective with tree-sum. Default is MRL.
        -a AV, --averagemethod=AV       The strategy to find the average quartet tables.
                                        Options are geometric mean (gmean), averaging (mean),
                                        or root mean square (otherwise). Default is mean.
        -p MET                          The method to summarize quartet results around each
                                        node, freq, or log, Default is freq
        -l FILLMETHOD                   The method to fill empty cells in distance tables, used with distance-sum
                                        or tree-sum. Options are const, rand, or normConst. Default is const
        -m METHOD, --distmethod=METHOD  The method to compute distances between pairs of species. 
                                        Options are prod (default) or min.

Datasets

There are mainly three datasets that we used in this paper.

Avian dataset dataset is a 45-taxon dataset, based on biological data, and have a single true species tree. This file contains model species tree, and true and estimated gene trees.
Mammalian dataset is a 37-taxon dataset, based on biological data, and have a single true species tree. contains estimated gene trees, true gene trees. This file contains model species tree, and true and estimated gene trees.

These two datasets used to evaluate performance of DISTIQUE on datasets with relative small number of species, varying ILS, and number of genes. For more details about these two datasets visit link

We analyzed these two datasets with different methods. Inferred species trees are available here. The methods that we used for our analyses are ASTRID (NJst)³, ASTRAL, DISTIQUE-AAD, DISTIQUE-AMD, DISTIQUE all pairs inferred with different distance methods (i.e. different methods implemented in FastME, and PhyD*), DISTIQUE distance sum with 1, 2, 4, and 8 rounds of anchor sampling around each polytomy, and DISTIQUE tree sum with 1, 2, 4, and 8 rounds of anchor sampling around each polytomy.

Simphy dataset was also used for evaluating ASTRAL-II and is simulated using SimPhy ⁴. True and estimated gene trees, and also true and estimated species trees using ASTRAL-II are available at ASTRAL-II.
We also provide the inferred species trees using DISTIQUE all-pairs (for some model conditions), DISTIQUE distance sum with 2 and 8 rounds of anchor sampling, and ASTRID for this dataset.

The RF distances between true and inferred species trees are available at this link for avian and mammalian, and here for simphy dataset (in csv format). You would find the running times for the simphy dataset at the following link.

Commands and Version numbers

** Important Note:** All of the results published in this paper, are produced with this release of DISTIQUE. In this version of DISTIQUE, a specific random seed number was not, and as a result, there is some randomness across different runs of DISTIQUE on the same exact dataset. This can translate to small changes in the final results of the DISTIQUE-distance-sum and the DISTIQUE-tree-sum algorithms. The lack of determinism typically results in very small changes. In future versions of DISTIQUE, we plan to specify a random seed number, such that the results become reproducible.

AAD

python distique.py -a mean -m prod -t 1 -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]

AMD

python distique.py -a mean -m prod -t 2  -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]

DISTIQUE all pairs with different summary methods

python distique.py -a mean -m prod -t 2  -u fastme -z D -g [GENE TREES] -o [OUTPUT DIR]

DISTIQUE distance-sum

python distique.py -a mean -m prod -t 3  -u fastme -z B -g [GENE TREES] -o [OUTPUT DIR]

DISTIQUE tree-sum

python distique.py -a mean -m prod -t 4 -u fastme -z B -g [GENE TREES] -o [OUTPUT DIR]

ASTRID

python ASTRID.py −m fastme2 −i [GENE TREES] −o [OUTPUT SPECIES TREE]

*ASTRAL

java −Xmx2000M −jar astral.4.7.8.jar −i [GENE TREES] −o [OUTPUT SPECIES TREE]

Biological dataset

We reanalyze the avian biological dataset of Jarvis et. al. This dataset is a dataset of 2022 supergene trees from an avian dataset.⁵ The resulting species trees are available at biological species trees.

Report bugs

Contact Erfan Sayyari

This page was generated by GitHub Pages using the Cayman theme by Jason Long.

Lefort, Vincent, Richard Desper, and Olivier Gascuel. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular biology and evolution 32.10 (2015): 2798-2800. ↩
Criscuolo, Alexis, and Olivier Gascuel. Fast NJ-like algorithms to deal with incomplete distance matrices. BMC bioinformatics 9.1 (2008): 166. ↩
Vachaspati, Pranjal, and Tandy Warnow. ASTRID: accurate species trees from internode distances. BMC genomics 16.Suppl 10 (2015): S3.ASTRID ↩
Mallo, Diego, Leonardo De Oliveira Martins, and David Posada. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic biology 65.2 (2016): 334-344. ↩
Jarvis, Erich D., et. al. Phylogenomic analyses data of the avian phylogenomics project. GigaScience 4.1 (2015): 1-9. ↩

DISTIQUE

DISTIQUE by Erfan Sayyari

DISTIQUE

How DISTIQUE infers species trees

How to install DISTIQUE

Dependencies

Optional Dependencies

How DISTIQUE works

Datasets

Commands and Version numbers

Biological dataset

Report bugs