SPbU Scientists: New Algorithm Makes the Advanced Technology of Genome Sequencing More Efficient
Bioinformaticians at St. Petersburg State University have managed to create a new algorithm, TruSPAdes, which greatly improves the efficiency of the TruSeq Synthetic Long Reads (TSLR) sequencing technology. The new development provides longer and more accurate genome fragments for assembly. The findings are published in Nature Methods, a prestigious scientific journal . The authors of the article are Anton Bankevich, a Junior Research Fellow at St. Petersburg State University, and Pavel Pevzner, the Head of the Centre for Algorithmic Biotechnology of the Institute for Translational Biomedicine at St. Petersburg State University.
Bioinformaticians at St. Petersburg State University have managed to create a new algorithm, TruSPAdes, which greatly improves the efficiency of the TruSeq Synthetic Long Reads (TSLR) sequencing technology. The new development provides longer and more accurate genome fragments for assembly. The findings are published in Nature Methods, a prestigious scientific journal . The authors of the article are Anton Bankevich, a Junior Research Fellow at St. Petersburg State University, and Pavel Pevzner, the Head of the Centre for Algorithmic Biotechnology of the Institute for Translational Biomedicine at St. Petersburg State University.
Genome assembly (the reconstruction of the nucleotide sequence) is one of the central tasks of bioinformatics. This process consists of two phases: sequencing ("cutting" DNA molecules into small fragments and reading each fragment separately) and assembling, i.e. the application of certain algorithms to reconstruct the genome from its fragments. The longer and the more accurate the fragments obtained through sequencing are, the higher the efficiency of these algorithms becomes.
For nearly 20 years, scientists around the world have been trying to improve both stages. Today, there are a number of companies that develop and improve sequencing technology. One of them is TruSeq Synthetic Long Reads (TSLR), a technology developed by Illumina, a recognised leader in this field. This unique process allows to separate the assembly process into two stages. Thanks to this technology, algorithmicians have the opportunity to work with the interim information (shorter fragments, the so-called reads), analyse it, and then restore longer genome fragments. This interim stage, which is called barcode assembly, is the subject of the study of the SPbU experts.
The scientists have analysed the properties of the TruSeq technology, revealed a number of deficiencies and created a new algorithm to compensate for them. One of the downsides of this technology is the formation of reads, some of which belong to one genome fragment and some to the other. These are the so-called chimeric connections. "Such a problem is difficult to solve," emphasizes Anton Bankevich, a Junior Research Fellow at SPbU. "We have to find these connections and remove them. To do this, having compared the reads, we have to determine which ones are correct and which are not. Illumina Experts did not know that such a problem could be extremely relevant to TSLR. We have proved that our new algorithm can cope with it."
To solve this problem, the SPbU scientists suggest using a standard method which was employed in genome assembly as early as in the early 1990s. Back then, bioinformatics began to use a mathematical model called De Bruijn graph. It is a versatile tool that helps to convert the information recovered from reads into a visual form. If you build this graph for reads, it is the same as if it was built for the entire genome. This tool helped the SPbU scientists to find incorrect connections, including chimeric ones, analyse their properties, and then delete them.
The SPbU researchers have already faced the problem of chimeric connections while working on one of their first projects, the development of the SPAdes tool. Identifying these reads was associated with the use of the MDA technology, which allows sequencing based on a single cell. Until today, no one could imagine that the TSLR technology could present the same problems.
The results obtained in the study will improve the effectiveness of the TSLR technology by 20%. The algorithm can be installed on the servers of various laboratories that work on genome assembly. It will process the data received via TSLR. As a result, scientists will get longer and more accurate genome fragments.
More on the results of the study: TruSPAdes: barcode assembly of TruSeq synthetic long reads, Nature Methods, 2016.
The TruSeq Synthetic Long Reads technology is a part of the new wave of long reads' sequencing technology that began in 2011 with the appearance of SMRT (developed by Pacific BioSciences). TSLR makes it possible to obtain very accurate genome fragments at a much lower cost compared with competing technologies.
Today, the use of TSLR is connected with several large projects whose implementation, among others, involves SPbU experts. One of these projects is the study of metagenomes (collective genomes of microorganisms). This technology makes it possible to sequence a metagenome and get almost perfectly assembled genes, which was not possible with the technology of the previous generation . Metagenome sequencing of certain bacteria living in people will help to identify the role of these organisms in certain diseases.
The second project related to the search for genome variations also uses the new technology. It will make it possible to find complex variations that have remained unnoticed until recently. This will allow scientists to have a better insight into the variability of the human genome and to establish the true causes of many genetic diseases.