TY - JOUR
T1 - How complete are “complete” genome assemblies?—an avian perspective
AU - Peona, Valentina
AU - Weissensteiner, Matthias H.
AU - Suh, Alexander
N1 - Funding Information:
We thank Mozes Blom, Anne‐Marie Dion‐Côté, Jan Engler, Per Eric-son, Takeshi Kawakami, Cormac Kinsella and Robyn Womack for valuable comments on an earlier version of the manuscript. We also thank Shawn Narum and four anonymous reviewers for further improving this manuscript with their comments. A.S. was supported by grants from the Swedish Science Foundation (2016‐05139) and the SciLifeLab Swedish Biodiversity Program (2015‐R14).
Publisher Copyright:
© 2018 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd.
PY - 2018/11
Y1 - 2018/11
N2 - The genomics revolution has led to the sequencing of a large variety of nonmodel organisms often referred to as “whole” or “complete” genome assemblies. But how complete are these, really? Here, we use birds as an example for nonmodel vertebrates and find that, although suitable in principle for genomic studies, the current standard of short-read assemblies misses a significant proportion of the expected genome size (7% to 42%; mean 20 ± 9%). In particular, regions with strongly deviating nucleotide composition (e.g., guanine-cytosine-[GC]-rich) and regions highly enriched in repetitive DNA (e.g., transposable elements and satellite DNA) are usually underrepresented in assemblies. However, long-read sequencing technologies successfully characterize many of these underrepresented GC-rich or repeat-rich regions in several bird genomes. For instance, only ~2% of the expected total base pairs are missing in the last chicken reference (galGal5). These assemblies still contain thousands of gaps (i.e., fragmented sequences) because some chromosomal structures (e.g., centromeres) likely contain arrays of repetitive DNA that are too long to bridge with currently available technologies. We discuss how to minimize the number of assembly gaps by combining the latest available technologies with complementary strengths. At last, we emphasize the importance of knowing the location, size and potential content of assembly gaps when making population genetic inferences about adjacent genomic regions.
AB - The genomics revolution has led to the sequencing of a large variety of nonmodel organisms often referred to as “whole” or “complete” genome assemblies. But how complete are these, really? Here, we use birds as an example for nonmodel vertebrates and find that, although suitable in principle for genomic studies, the current standard of short-read assemblies misses a significant proportion of the expected genome size (7% to 42%; mean 20 ± 9%). In particular, regions with strongly deviating nucleotide composition (e.g., guanine-cytosine-[GC]-rich) and regions highly enriched in repetitive DNA (e.g., transposable elements and satellite DNA) are usually underrepresented in assemblies. However, long-read sequencing technologies successfully characterize many of these underrepresented GC-rich or repeat-rich regions in several bird genomes. For instance, only ~2% of the expected total base pairs are missing in the last chicken reference (galGal5). These assemblies still contain thousands of gaps (i.e., fragmented sequences) because some chromosomal structures (e.g., centromeres) likely contain arrays of repetitive DNA that are too long to bridge with currently available technologies. We discuss how to minimize the number of assembly gaps by combining the latest available technologies with complementary strengths. At last, we emphasize the importance of knowing the location, size and potential content of assembly gaps when making population genetic inferences about adjacent genomic regions.
KW - Birds
KW - Genomics
KW - Hybrid assembly
KW - Long reads
KW - Multiplatform sequencing
KW - Repeats
UR - http://www.scopus.com/inward/record.url?scp=85053660675&partnerID=8YFLogxK
U2 - 10.1111/1755-0998.12933
DO - 10.1111/1755-0998.12933
M3 - Comment/debate
C2 - 30035372
AN - SCOPUS:85053660675
SN - 1755-098X
VL - 18
SP - 1188
EP - 1195
JO - Molecular Ecology Resources
JF - Molecular Ecology Resources
IS - 6
ER -