Fragmentary gene sequences negatively impact gene tree and species tree reconstruction

Abstract

Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step the approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.

Experiments and Codes

Please note that all the code used is available on this github repository. The code uses python2 or bash, and dependencies are dendropy4, nwcik utility and PASTA.

In this section, we will describe the commands we used for generating and performing the simulations as well as the biological analyses.

Filtering alignments

We used mask-for-gt.sh (internally uses PASTA) to remove alignments with lots of gap characters.

Generate simulated dataset

We used draw_parameters.py (please check the comments inside the code for more information) and the following files to generate the fragmentary stats for each species in each gene.

To generate the sequence files we used generateFragmentary.py passing the output of draw_parameters.py.

Inferring gene trees

We used the following commands to infer the gene trees:

bash runraxml-bestML.sh [ALIGNMENT NAME WITHOUT PHYLIP OR FASTA] [DT] [GENE IDs] [labels] [GENE DIR] [NUM CPU]

This command assumes a specific structure. It assumes that the fasta file is under this directory:

[GENE DIR]/[GENE IDs]/[DT]-[ALIGNMENT NAME WITHOUT PHYLIP OR FASTA]/[GENE IDs]-[ALIGNMENT NAME WITHOUT PHYLIP OR FASTA].fasta

RAxML

raxmlHPC -m <model> -n best -s <input_phylip> -p <seed_num> -N 10

   raxmlHPC-PTHREADS -T <CPU> -m <model> -n best -s <input_phylip> -p <seed_num> -N 10

where $s is the seed number, and for NA, model is GTRGAMMA, and for amino acids we let RAxML to find the best model automatically using the command:

perl ProteinModelSelection.pl <input_phylip> > ../bestModel.<alignname>

   model=PROTGAMMA`sed -e "s/.* //g" bestModel.<alignname>`
   

Please note that for the simulated dataset we used RAxML with 2 starting trees, while for biological analysis, we used 10 starting trees.

FastTree

For nucleotide alignment, we use this command:

fasttree -gtr -gamma -nt <input_phylip> > fasttree.tre.best 2> ft.log.best

and for protein alignment, we use the default substitution model (JTT+CAT):

fasttree <input_phylip> > fasttree.tre.best 2> ft.log.best

ASTRAL

java -jar astral.4.11.1.jar -i <genetrees> -o <species_tree> > logfile 2>&1
   

Filtering sequences

We first filter sites that have less than site_threshold non-gap characters, and then filter the sequences that have less than sequence_threshold non-gap sequences.

python run_seqtools.py  -infile <input_sequence> -masksites <site_threshold> -outfile <site_filtered_sequence> > <logfile> 2>&1

   python run_seqtools.py -infile <site_filtered_sequence> -filterfragments <sequence_threshold> -outfile <filtered_sequence> > <logfile> 2>&1

Miscellaneous

Data

We used simulated as well as biological data to study the effects of fragmentary data on the quality of gene trees and species trees.

Simulations

In simulations, we study the impacts of fragmentary data and the filtering strategy on the accuracy of gene trees and, eventually, species trees’ accuracy. The simulated dataset is available on Drayad. The simulated dataset has 50 replicates (in 50 separate folders), with 101 species (one outgroup). For each replicate, we provide original gene sequences (all-genes.phylip), and the fragments inserted version (all-genes-fragAdded.phylip). We also provide the FastTree (using double precision), e.g. in FNA-genes.fragAdded-mask1sites.mask1taxa, and RAxML, e.g. in FNA-genes.fragAdded-mask1sites.mask1taxa-raxml, gene trees (using 50, 200, and 1000 genes) and the corresponding estimated species trees. Folders with raxml at the end of their name contain RAxML gene trees. To indicate the filtering strategy we use this format: mask[x%]sites.mask[y%]taxa, which indicates that we removed all sites with less than x% non-gap characters, and we removed all species that have less than y% non-gap characters.

Biological

We studied an empirical transcriptomic data set of insects consisting of 1,478 protein-coding genes of 144 taxa, where 27% of the alignment is gaps (Misof et al. 2014). We now share the biological sequences, gene trees, and species trees:

References