A new development of the Center for Algorithmic Biotechnology at St Petersburg University, called coronaSPAdes, makes it possible to assemble RNA virus genomes, primarily coronaviruses. According to preliminary results, with its help it has already been possible to assemble the genome sequences of previously unknown coronaviruses.8
CoronaSPAdes is a special mode of the SPAdes assembler (Saint Petersburg Assembler), the flagship software product of the Center for Algorithmic Biotechnology at St Petersburg University, known throughout the world. Using SPAdes, scientists from different countries analysed the pathogens that caused the outbreaks of Middle East Respiratory Syndrome (MERS) in Saudi Arabia; Ebola Virus Disease in the Congo; Neisseria gonorrhoeae in England; pneumococcal meningitis in Ghana; Dengue Fever in Sumatra; and dozens of other outbreaks.
The SPAdes assembler and its various operating modes make it possible to decipher the genomes of living organisms, including viruses. The fact is that biologists still do not know how to read genomes in the same way as we read a book: from beginning to end. Instead, they ‘read’ small fragments, which they then compile into full text. Therefore, the genome assembly is not much different from the assembly of a million-piece puzzle without the reference picture. This problem belongs to one of the most complex algorithmic problems in bioinformatics. To solve it, it is necessary to use special tools called genome assemblers.
‘The requests of the scientific community inspired us to develop the coronaSPAdes mode,’ said Anton Korobeynikov, leading research fellow at the Center for Algorithmic Biotechnology at St Petersburg University and one of the main authors of the new product. ‘We received numerous questions from various laboratories about how best to assemble RNA viruses with the help of the SPAdes family of tools, including our collaborators from the European Bioinformatics Institute (EMBL-EBI), with which we have a joint grant from the Russian Foundation for Basic Research; and a community of scientists working on discovery of novel corona- and other viruses in public data as part of the Serratus research collaboration. The existing modes of the SPAdes assembler do not give a tangible advantage over competitor solutions. Therefore the task was to develop a new mode that takes into account the unique structural features of the coronaviridae genome and sequencing data.’
The crucial role in this development belongs to Dmitrii Meleshko, a research fellow at the Center for Algorithmic Biotechnology at St Petersburg University. It is also important to note that coronaSPAdes is based on previous laboratory developments and the code base of the tools from SPAdes assembler family: metaSPAdes, rnaSPAdes, metaviralSPAdes and biosyntheticSPAdes. Without these developments, creating this mode would be impossible.
It took a couple of weeks to develop the first version of coronaSPAdes. The test data provided by the Serratus collaboration helped to complete the work in such a short time. The assembler developers are currently busy with its further improvement, but even now it makes it possible to assemble de novo the genomes of coronaviruses much more efficiently and better than via alternative approaches. For example, according to preliminary results, full-length genomes of previously unknown coronaviruses were assembled from several public data sets.
The coronaSPAdes mode specifically deals with the artefacts of RNA sequencing data, and also implements unique algorithmic solutions aimed at improving the assembly of the coronaviridae genome sequences. Moreover, the approaches laid down in coronaSPAdes can be used in the future to develop new assemblers using information about the structure of other genomes.
‘The coronaSPAdes assembler immediately began to be widely used by scientists. However, it is difficult for us to assess the boundaries of its use because we do not track all users. coronaSPAdes is open source software and is available for download and use by anyone interested. According to our data, in addition to EMBL-EBI, such large research communities as Serratus, MetaSUB Consortium and NextFlow have shown interest in the assembler,’ noted Anton Korobeynikov.
Alla Lapidus, Associate Director of the Center for Algorithmic Biotechnology at the Institute of Translational Biomedicine at St Petersburg University, said that in a short time several new programmes had been developed in the laboratory. Their goal is to quickly and efficiently process genomic data needed to analyse viruses that cause various diseases and primarily coronaviruses.
‘In 2020, the epidemiological situation in the world leaves no space for scientists and doctors to relax – no sooner have they managed to cope with coronavirus than reports have appeared of a possibly new strain of swine flu, called G4 EA H1N1,’ said Alla Lapidus. ‘To find out if this strain is really a new one or a previously known seasonal strain is possible primarily through the analysis of its genome. And some days ago, there were cases reporting bubonic plague in China caused by bacterium Yersinia pestis. In such an unfavourable situation, not only does the need for analytical methods increase, but also for competent specialists. This year, the first master’s students in Bioinformatics in the history of St Petersburg University have graduated; and I wish our graduates great scientific developments and discoveries.’
The Center for Algorithmic Biotechnology was established at St Petersburg University at the end of 2014 as part of the St Petersburg University megagrant project for solving top priority computational problems of modern biomedicine. The flagship of the laboratory is the SPAdes genome assembler (Saint Petersburg Assembler). It is used by thousands of experts in genomics throughout the world.