Tool for finding linked genetic polymorphisms in reference-less complex plant genomes from unassembled next-generation reads

Project Details


Differences in the genome of individuals of the same species, called polymorphisms, are the genetic basis of traits such as resistance or susceptibility to disease. By identifying polymorphisms it is possible to pinpoint either the agents of resistance or susceptibility or at the least locate regions on the genome that are placed nearby and can act as positional markers that can be associated with the trait of interest.
Many wild populations of plant species that are closely related to domesticated varieties important for food and industry are resistant to common diseases that could potentially devastate important crops across the world. Combating these diseases chemically is both costly and environmentally damaging so breeding varieties that are resistant is absolutely necessary for food security.
Genetic methods for identifying markers are time consuming and require large amounts of expensive and slow laboratory work. New methods in high-throughput DNA sequencing are able to comprehensively sample entire genomes at an affordable cost. These technologies return many millions of small fragments not a continuous sequence. The volumes of data generated by the NGS instruments has resulted in the need for new methods to assemble the fragments or align to an existing, previously assembled reference sequence. Currently, polymorphism identification relies on having some sort of reference to which sequence reads can be aligned. Aligned reads are then examined for consensus differences to the reference that indicate a genetic difference between the genome of that sampled in the reads and the reference. Naturally this is only possible where a reference genome exists. Since the creation of even a rough draft genome sequence can take many months, the detection of polymorphisms specifying resistance to diseases in relatives to agriculturally important organisms that have no such reference becomes a massively time consuming and difficult task. When reference sequence is available identifying polymorphisms among many individuals from a population, to associate genotypes with specific phenotypes for example, require many cycles of alignment and comparison.
Our objective is to develop a tool that takes advantage of the recent developments in high-throughput DNA sequencing and new computational methods to identify polymorphisms between multiple sources without the need for comparison with a reference sequence. These methods will allow us to detect genetic variants directly from the raw sequences reads without the requirement of a reference genome. The time required would be on the order of hours, rather than months or years in the case where assembly may be required. The tool will produce short but useful genomic mini-assemblies with embedded polymorphisms that can be utilised by bench workers for downstream experiments. We will be able to provide ranking of SNPs and classifications based on the provenance of different reads, for example detecting SNPs common to individuals with a trait. The tool will be an important addition to the repertoire of methods available to bioinformaticians involved in polymorphism detection and could invaluable to projects without an available reference sequence. The tool will also prove useful to bioinformaticians with a reference sequence, we will be able to remove the need for many sequential alignments to a reference and compress subsequent polymorphism detection into a single step.
Effective start/end date31/10/1130/04/13


  • Biotechnology and Biological Sciences Research Council: £112,829.00