Fragmentary gene sequences negatively impact gene tree and species tree reconstruction

Abstract

Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step the approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.

Experiments and Codes

Please note that all the code used is available on this github repository. The code uses python2 or bash, and dependencies are dendropy4, nwcik utility and PASTA.

In this section, we will describe the commands we used for generating and performing the simulations as well as the biological analyses.

Filtering alignments

We used mask-for-gt.sh (internally uses PASTA) to remove alignments with lots of gap characters.

Generate simulated dataset

We used draw_parameters.py (please check the comments inside the code for more information) and the following files to generate the fragmentary stats for each species in each gene.

average_branch_lengths.csv: average tip-to-root distances for biological Insects gene trees
fragmentary_augmented.csv: frequency of gap characters for the biological Insects dataset
simulated_average_branch_lengths.csv: averge tip-to-root distances for biological Insects gene trees per each replicate.

To generate the sequence files we used generateFragmentary.py passing the output of draw_parameters.py.

Inferring gene trees

We used the following commands to infer the gene trees:

bash runraxml-bestML.sh [ALIGNMENT NAME WITHOUT PHYLIP OR FASTA] [DT] [GENE IDs] [labels] [GENE DIR] [NUM CPU]

This command assumes a specific structure. It assumes that the fasta file is under this directory:

[GENE DIR]/[GENE IDs]/[DT]-[ALIGNMENT NAME WITHOUT PHYLIP OR FASTA]/[GENE IDs]-[ALIGNMENT NAME WITHOUT PHYLIP OR FASTA].fasta

It first converts the fasta file to phylip (using convert_to_phylip.sh), and then finds the list of identical alignments in the phylip file (using listRemovedTaxa.py). Then RAxML (raxmlHPC) will run to remove identical alignments. For amino acid alignements (DT should be FAA), we find the best model of evolution using ProteinModelSelection.pl. Then raxmlHPC-PTHREADS or raxmlHPC (based on the number of cpus) will be used to infer the gene trees (please change -N 20 or -N 2 at lines 89 or 91 for your usage. This number defines the number of repeats for RAxML). Finally, it uses addIdenticalTaxa.py to add back the identical species to the gene trees.
For inferring the gene trees (as well as bootstrapping) with FastTree, we used this command runraxml-fasttreeboot.sh. It internally uses fasttree (double-precision version) and same scripts listed above.
For doing bootstrapping we used this command runraxml-fasttree-start-boostrapping.sh, which uses FastTree (double-precision version) internally, and the same scripts listed above.

RAxML

raxmlHPC -m <model> -n best -s <input_phylip> -p <seed_num> -N 10

   raxmlHPC-PTHREADS -T <CPU> -m <model> -n best -s <input_phylip> -p <seed_num> -N 10

where $s is the seed number, and for NA, model is GTRGAMMA, and for amino acids we let RAxML to find the best model automatically using the command:

perl ProteinModelSelection.pl <input_phylip> > ../bestModel.<alignname>

   model=PROTGAMMA`sed -e "s/.* //g" bestModel.<alignname>`

Please note that for the simulated dataset we used RAxML with 2 starting trees, while for biological analysis, we used 10 starting trees.

FastTree

For nucleotide alignment, we use this command:

fasttree -gtr -gamma -nt <input_phylip> > fasttree.tre.best 2> ft.log.best

and for protein alignment, we use the default substitution model (JTT+CAT):

fasttree <input_phylip> > fasttree.tre.best 2> ft.log.best

ASTRAL

java -jar astral.4.11.1.jar -i <genetrees> -o <species_tree> > logfile 2>&1

Filtering sequences

We first filter sites that have less than site_threshold non-gap characters, and then filter the sequences that have less than sequence_threshold non-gap sequences.

python run_seqtools.py  -infile <input_sequence> -masksites <site_threshold> -outfile <site_filtered_sequence> > <logfile> 2>&1

   python run_seqtools.py -infile <site_filtered_sequence> -filterfragments <sequence_threshold> -outfile <filtered_sequence> > <logfile> 2>&1

Miscellaneous

To remove third codon positions from alignments we used remove_3rd_codon_nt_fas.sh.
To convert fasta file to phylip we used convert_to_phylip.sh.
To generate fragmentary statistics we used generate_fragmentary_stat.sh.
To generate GC content statistics we used gc-stats.py.
To calculate bootstrap supports we used draw_support_on_best_ML.sh.
We used ASTRAL version 4.11.1 to infer species trees
To root biological dataset we used root-nw_friendly.py.

Data

We used simulated as well as biological data to study the effects of fragmentary data on the quality of gene trees and species trees.

Simulations

In simulations, we study the impacts of fragmentary data and the filtering strategy on the accuracy of gene trees and, eventually, species trees’ accuracy. The simulated dataset is available on Drayad. The simulated dataset has 50 replicates (in 50 separate folders), with 101 species (one outgroup). For each replicate, we provide original gene sequences (all-genes.phylip), and the fragments inserted version (all-genes-fragAdded.phylip). We also provide the FastTree (using double precision), e.g. in FNA-genes.fragAdded-mask1sites.mask1taxa, and RAxML, e.g. in FNA-genes.fragAdded-mask1sites.mask1taxa-raxml, gene trees (using 50, 200, and 1000 genes) and the corresponding estimated species trees. Folders with raxml at the end of their name contain RAxML gene trees. To indicate the filtering strategy we use this format: mask[x%]sites.mask[y%]taxa, which indicates that we removed all sites with less than x% non-gap characters, and we removed all species that have less than y% non-gap characters.

Biological

We studied an empirical transcriptomic data set of insects consisting of 1,478 protein-coding genes of 144 taxa, where 27% of the alignment is gaps (Misof et al. 2014). We now share the biological sequences, gene trees, and species trees:

RAxML log files are available on GitHub
Gene trees, species trees, and sequences are available on Dryad

References

A. Stamatakis: “RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies”. In Bioinformatics, 2014
Junier, T. and Zdobnov, E.M., 2010. The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics, 26(13), pp.1669-1670.
Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J. and Warnow, T., 2015. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. Journal of Computational Biology, 22(5), pp.377-386.
Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, Frandsen PB, Ware J, Flouri T, Beutel RG, Niehuis O, Petersen M. 2014. Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210): 763–767.
Price, M.N., Dehal, P.S. and Arkin, A.P., 2010. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one, 5(3), p.e9490.