Bioinformaticians from St Petersburg University help to discover 130,000 new viruses in the public genomic databases

International collaboration of scientists from: St Petersburg University (the Center for Algorithmic Biotechnology); the Institut Pasteur (France); the University of British Columbia (Canada); the University of California, Berkeley (USA); the Heidelberg Institute for Theoretical Studies (Germany); and other researchers across the world has facilitated discovery of over 130,000 previously unknown viruses in the open-source genomic databases.
According to the scientists, there are trillions of viruses hitherto unknown to science. Many of them can be deadly and have major pandemic potential. Not all of them, however, are so dangerous.
A research paper published in the journal Nature can serve as the basis for the so-called petabase-scale genomics. A petabase is one thousand trillion base pairs of DNA sequence; hence, petabase-scale genomics operates on a previously incomprehensible amount of DNA and RNA data. The reported study has analysed 16 petabases of sequencing data.
‘In order to cope with such volumes of information, we developed a cloud computing architecture, Serratus, tailored for ultra-high throughput sequence alignment at the petabase scale,’ said Anton Korobeynikov, a participant in the international project, Leading Research Fellow at the Center for Algorithmic Biotechnology, St Petersburg University. ‘Nonetheless, it would have been much more difficult to “re-assemble” genomic viral data without an assembler for decoding RNA viral genomes – coronaSPAdes. We developed it together with my colleague Dmitry Meleshko at the Center for Algorithmic Biotechnology.’
One of the main objectives of the Serratus Project was to create a hyper-optimised pipelined computational architecture for processing petabases of RNA-sequencing datasets, narrowing down the scale from petabytes to gigabytes available for relatively fast processing using conventional computing capacities. The coronaSPAdes assembler is our contribution to the unique Serratus architecture. Everyone did what they know best, and the result was success.
Dmitry Meleshko, Junior Research Fellow at the Center for Algorithmic Biotechnology, St Petersburg University
Although it has not always been possible to recover a full-length viral genome, even partial sequences enabled building philogenetic trees of viruses that show how different viruses are related to each other and how they develop.
The Serratus architecture is cost-optimised for big data analysis. This collaborative platform can analyse 1,000,000+ sequencing samples per day for under one cent per sample. Ultra-high throughput alignment at this efficiency – isn’t this a dream of all geneticists?!
Anton Korobeynikov, Leading Research Fellow at the Center for Algorithmic Biotechnology, St Petersburg University
The study has found over 250 giant bacterial viruses that infect bacteria, or bacteriophages. They are similar to viruses that infect algae. Close relatives of these ‘huge phages’ have been identified, for example, in a human from Bangladesh and in groups of cats and dogs from Great Britain.
Researchers with the Serratus Project found 132,000 RNA viruses (where just 13,500 were known previously). Thus, the Serrarus platform enabled re-analysing all public RNA sequencing data to uncover almost ten times more RNA viruses than had been previously known.
‘The public repository of open source software and already obtained sequencing data unlocked the door to many new discoveries, especially considering that the corpus of cloud-based public DNA and RNA sequencing datasets is growing exponentially every day. We expect to be able to identify over 100 million of RNA viruses by the end of this decade,’ said Dmitry Meleshko.