TY - JOUR
T1 - Reconstructing (super)trees from data sets with missing distances: Not all is lost
AU - Kettleborough, George
AU - Dicks, Jo L.
AU - Roberts, Ian N.
AU - Huber, Katharina T.
N1 - The online version of this article has been published in Oxford Open, which is an open access initiative (more information available here [www.oxfordjournals.org]).
Readers are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the journal and Oxford University Press are attributed as the original place of publication with the correct citation details given. If an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work, this must be clearly indicated. For those wishing to make commercial use of the article, please go to [email protected] for permissions information or see the website [www.oxfordjournals.org].
Oxford Journals, Oxford University Press [www.oxfordjournals.org] Information for authors [www.oxfordjournals.org]
PY - 2015
Y1 - 2015
N2 - The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, present significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for datasets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial datasets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for datasets of all sizes, including large datasets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological datasets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial datasets.
AB - The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, present significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for datasets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial datasets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for datasets of all sizes, including large datasets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological datasets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial datasets.
KW - Phylogenetic trees
KW - rooted trees
KW - partial distance
KW - supertree
KW - lasso
KW - molecular clock
KW - dendrogram
U2 - 10.1093/molbev/msv027
DO - 10.1093/molbev/msv027
M3 - Article
VL - 32
SP - 1628
EP - 1642
JO - Molecular Biology and Evolution
JF - Molecular Biology and Evolution
SN - 0737-4038
IS - 6
ER -