Punchline: Identifying and comparing significant Pfam protein domain differences across draft whole genome sequences

Research output: Working paperPreprint

Abstract

Motivation Short-read draft paired-end Illumina assemblies can be fragmented, contain many contigs and be impacted on by repeat regions, caused by mobile element activity within the genome or inherently repetitive gene structure. Annotating such assemblies for function and analysing gene content can be challenging if predicted genes are fragmented across contigs. Such a case can often occur within specific families of genes such as longer genes with repeating domains, genes specifying several transmembrane domains and of unusual nucleotide content. These genes can often be virulence determinants, therefore losing these specific types of data can seriously impact downstream studies.

Results Rather than studying the predicted gene content of draft genomes, we examined predicted protein content using the Pfam domain complements of predicted proteins. We produced a workflow, Punchline, to study the genetic content of draft contig assemblies by looking at the complement of short domains that are unlikely to be affected. We investigated a dataset of Bacteroides ovatus in terms of a grouping involving the vertebrate host from which the organism was isolated and identified potential host restricted functions and host restricted phylogenetic clustering.
Original languageEnglish
Publisherbiorxiv
DOIs
Publication statusPublished - 11 Jul 2019

Cite this