Abstract
Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
Original language | English |
---|---|
Pages (from-to) | 594-602 |
Number of pages | 9 |
Journal | Nature |
Volume | 622 |
Issue number | 7983 |
Early online date | 11 Oct 2023 |
DOIs | |
Publication status | Published - 19 Oct 2023 |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
Unraveling the functional dark matter through global metagenomics. / Pavlopoulos, Georgios A.; Baltoumas, Fotis A.; Liu, Sirui et al.
In: Nature, Vol. 622, No. 7983, 19.10.2023, p. 594-602.Research output: Contribution to journal › Article › peer-review
TY - JOUR
T1 - Unraveling the functional dark matter through global metagenomics
AU - Pavlopoulos, Georgios A.
AU - Baltoumas, Fotis A.
AU - Liu, Sirui
AU - Selvitopi, Oguz
AU - Camargo, Antonio Pedro
AU - Nayfach, Stephen
AU - Azad, Ariful
AU - Roux, Simon
AU - Call, Lee
AU - Ivanova, Natalia N.
AU - Chen, I. Min
AU - Paez-Espino, David
AU - Karatzas, Evangelos
AU - Acinas, Silvia G.
AU - Ahlgren, Nathan
AU - Attwood, Graeme
AU - Baldrian, Petr
AU - Berry, Timothy
AU - Bhatnagar, Jennifer M.
AU - Bhaya, Devaki
AU - Bidle, Kay D.
AU - Blanchard, Jeffrey L.
AU - Boyd, Eric S.
AU - Bowen, Jennifer L.
AU - Bowman, Jeff
AU - Brawley, Susan H.
AU - Brodie, Eoin L.
AU - Brune, Andreas
AU - Bryant, Donald A.
AU - Buchan, Alison
AU - Cadillo-Quiroz, Hinsby
AU - Campbell, Barbara J.
AU - Cavicchioli, Ricardo
AU - Chuckran, Peter F.
AU - Coleman, Maureen
AU - Crowe, Sean
AU - Colman, Daniel R.
AU - Currie, Cameron R.
AU - Dangl, Jeff
AU - Delherbe, Nathalie
AU - Denef, Vincent J.
AU - Dijkstra, Paul
AU - Distel, Daniel D.
AU - Eloe-Fadrosh, Emiley
AU - Fisher, Kirsten
AU - Francis, Christopher
AU - Garoutte, Aaron
AU - Gaudin, Amelie
AU - Gerwick, Lena
AU - Mock, Thomas
AU - Novel Metagenome Protein Families Consortium
N1 - Funding Information: We thank H. Maughan for reading the paper; and all of the colleagues who contributed to the many facets of metagenomics, from sample collection to sequencing and annotation that made this work possible. The list of the JGI Proposal Award DOIs is available in Supplementary Table 13. This work used resources of the National Energy Research Scientific Computing Center (NERSC), supported by the Office of Science of the US Department of Energy (DOE). Additional computations were performed with the use of the Greek Research and Technology Network (GRNET) Aris High Processing Computing (HPC) infrastructure (project code: PR009008-BOLOGNA). This work was supported in part by the US DOE Joint Genome Institute (DE-AC02–05CH11231, in part), a DOE Office of Science User Facility; the Applied Mathematics program of the DOE Office of Advanced Scientific Computing Research (DE-AC02–05CH11231, in part), Office of Science of the US DOE; Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US DOE Office of Science and the National Nuclear Security Administration; DOE grant DE-SC0022098. G.A.P., F.A.B. and E.K. were supported by Fondation Santé and the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the ‘First Call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant’ (grant ID HFRI-17-1855-BOLOGNA). G.A.P. also acknowledges the Marie Skłodowska-Curie Individual Fellowships (MSCA-IF-EF-CAR, grant ID 838018, H2020-MSCA-IF-2018) and ‘The Greek Research Infrastructure for Personalized Medicine (pMedGR)’ (MIS 5002802), which is implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program ‘Competitiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). C.A.O. and I.I. acknowledge support by the project Elixir-GR (MIS 5002780), implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program Competitiveness, Entrepreneurship and Innovation (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). S.O. and S.L. are supported by NIH grant DP5OD026389 and the Moore–Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation 735929LPI (https://doi.org/10.46714/735929LPI). J.P.-R. was supported by the US DOE Genomic Sciences Program, award SCW1632; and work conducted at the LLNL was conducted under the auspices of the US DOE under Contract DE-AC52-07NA27344. Work from the consortium was supported by NSF grants OIA-1826734, DEB-1441717 and OCE-1232982; NSF 1921429; CONACyT grants A1-S-9889 and CB-2010-01-151007; US DOE, Office of Science, Office of Biological and Environmental Research (BER), Great Lakes Bioenergy Research Center (DOE BER DE-SC0018409 and DE-FC02-07ER64494); NSF grant OCE-082546, US DOE, Office of Science, Facilities Integrating Collaborations for User Science (FICUS) program, Office of Workforce Development for Teachers and Scientists, Office of Science Graduate Student Research (SCGSR) program; New Zealand Foundation for Research, Science and Technology grant CO1X0306 and National Science Foundation grant 1745341; NSF Division of Chemical, Bioengineering, Environmental and Transport Systems grants 1438092 and 1643486; NSF OCE-1559179, NSF OCE-1537951, NSF OCE-1459200, Gordon & Betty Moore Foundation Investigator Award 3789; the G. Unger Vetlesen and Ambrose Monell Foundations; the Natural Sciences and Engineering Research Council of Canada; Genome Canada and Genome British Columbia; the PR-INBRE BiRC program (NIH/NIGMS- award number P20 GM103475); Great Lakes Bioenergy Research Center, US DOE, Office of Science, Office of Biological and Environmental Research under award numbers DE-SC0018409 and DE-FC02-07ER64494; the Agriculture and Food Research Initiative, competitive grant 2009-447 35319-05186 from the US Department of Agriculture, National Institute of Food and Agriculture; Sol Leshin Foundation and the Shanbrom Family Fund; Towards Sustainability Foundation, Cornell Sigma Xi, NSERC PGS-D, NSF-BREAD (IOS-0965336), Cornell Biogeochemistry Program, Cornell Crop and Soil Science Department, USDA-NIFA Carbon Cycle (2014-6700322069) and the Cornell Atkinson Center for a Sustainable Future; Office of Science (BER), US DOE (DE-SC0014395); NSF grant OCE 0424602; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; Australian Research Council: DP150100244; NSF-OPP 1641019; NSF 1754756; NSF 1442231; NSF award OCE-173723; USDA National Institute of Food and Agriculture Foundational Program (award 2017-67019-26396); USDA NIFA award 2011-67019-30178; BER grant DE-SC0014395; National Science Foundation grant DEB-1927155; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; River Corridor Scientific Focus Area (SFA) project at Pacific Northwest National Laboratory (PNNL); grant NNX16AJ62G from NASA Exobiology; NASA Exobiology awards 80NSSC19K1633 and NNX17AK85G; NSF award DEB-1146149; US NSF (DEB 1912525); US DOE Office of Biological and Environmental Research (DE-SC0020382); NSF EAR-1820658; DE-FG02-94ER20137 from the Photosynthetic Systems Program, Division of Chemical Sciences, Geosciences and Biosciences (CSGB), Office of Basic Energy Sciences of the US DOE; Max Planck Society and the BioEnergy Science Center (BESC), a US DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science; and US DOE, Office of Science, Biological and Environmental Research, as part of the Plant Microbe Interfaces Scientific Focus Area at Oak Ridge National Laboratory.
PY - 2023/10/19
Y1 - 2023/10/19
N2 - Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
AB - Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
UR - http://www.scopus.com/inward/record.url?scp=85173864812&partnerID=8YFLogxK
U2 - 10.1038/s41586-023-06583-7
DO - 10.1038/s41586-023-06583-7
M3 - Article
C2 - 37821698
AN - SCOPUS:85173864812
VL - 622
SP - 594
EP - 602
JO - Nature
JF - Nature
SN - 0028-0836
IS - 7983
ER -