Sobaka.ru: Meet Ivan Alexandrov, a St Petersburg University researcher, who helped scientists completely map the last chromosome in the human genome
In the spring of 2022, scientists presented an almost complete continuous version of the reading of the human genome. Yet there was a white spot left in it, the male Y chromosome, which could not have been fully read before. In December, an article by an international consortium was posted for preview. It closed this gap. Ivan Alexandrov, a scientist at St Petersburg University, is a contributor to this work. In an interview with Sobaka.ru, he spoke about what this achievement would bring to medicine, research on male aging, the fight against cancer, and the study of evolution.
Could you please tell us about the mapping of the human genome? This is a big project that has been going on for over 30 years.
I would even say that these are several projects. They started talking about it in the late 1980s. There were big disputes whether it should be done at all and whether it was possible. Many doubted the success of that undertaking and insisted that was an extra waste of effort and resources of scientists.
As a result, they took a compromise path. They mapped not the entire genome, but those parts of it that were the most interesting then. These were the segments of DNA that included genes, i.e. relatively long unique stretches of DNA. They make up about 90% of the genome. The findings were published in 2003. And pretty quickly all the disputes about whether it was necessary to do that were forgotten. Because there has been tremendous progress!
And what was the point of that for humanity?
Knowledge about all the genes that are in the human DNA and their environment. Based on this work, scientists began to collect gigantic bases for further research. One of them is UK Biobank. It stores the data of hundreds of thousands of people.
DNA samples were taken from those people. In addition, they were asked in detail whether they had migraines, or had a tendency to high blood pressure, or had teeth issues, or were sleeping well at night. And now, scientists all around the world are exploring these databases, trying to uncover patterns of how a change in a particular gene is associated with complaints that these people have.
Do all the news that scientists have found the "insomnia gene" or the "obesity gene" come from there?
Not always directly, but yes. Scientists could have some assumptions. For example, they noticed that people who were not sleeping well at night somehow differ in one of the proteins in the body. However, the real statistical, verified, evidence that a particular section of the genome is associated with insomnia is now based on these databases. And none of that would have been possible without the publication of most part of the human genome in 2003.
Just as the numerous evolutionary studies would not be possible. This year, biologist Svante Pääbo became a Nobel Prize winner for them. He works with the genomes of Neanderthals, Denisovans and other ancient people.
But why then is there still news that the reading of the human genome is still in progress?
As I have said, not the entire genome was mapped in the 2000s. There were "holes" in this reading (about 10%) that were impossible to map at the time.
And what are these areas?
These are the so-called tandem repeats. If you remember, there are four types of nucleotides in DNA. These are organic compounds that make up our genome. They are indicated by four letters on the diagrams. These letters can line up in long original sequences of hundreds and thousands of characters. This is how genes are arranged. In this form, they are repeated in our chromosomes very rarely or even never at all.
Genes do not make up all of our DNA, but only part of it. The rest is filled with repeats I mentioned above. These are short sequences a couple of hundred or several thousand nucleotide ’letters’ long. In the chromosome, they can be repeated several thousand times in a row. It was in the 2000s that they could not be read.
And what was the problem?
First, I have to say a few words about how DNA is read in general. After we have taken a sample, we extract DNA from the cells. During this process, it is torn into many parts, which then need to be read and reassembled in their original sequence.
I recall one bioinformatician telling me how this work proceeds. He said, 'Imagine a room where several copies of the complete works of Lenin were kept. Then, there was an explosion in that room. After that, you are trying to re-compile these complete works by comparing fragments from different copies, to understand the sequence of letters on each page.
Yes, that is a good metaphor. So, earlier scientists could read such fragments up to a thousand nucleotide letters long. If you found two fragments in which 200 letters completely coincided, you understood that this was the same chromosome fragment, only torn off from the opposite edges. You could put them on top of each other, ’glue’ them, and look for the next fragment, which would intersect at least by 200 characters with the already collected one. And again, and again. Like a puzzle. All this, of course, is not done manually. We use special software called assemblers.
When you have long non-repeating sequences, all this works well. But when you have some fragment repeated several thousand times in a row, the assembler does not add it up as a long sequence, but simply considers that these are identical pieces of the same place and does not line them up in a long chain. The repeats collapse. It became possible to correctly read whole arrays of repeats only now, when we have the opportunity to map huge pieces of DNA consisting of tens and hundreds of thousands of letters at a time. As a result, this spring, an international group of scientists managed to publish a complete reading of all chromosomes of the haploid genome with an X chromosome, and now we have presented the last piece, the Y chromosome. By the way, my colleagues at St Petersburg University played a significant part in this work. They were developing new assemblers for checking the quality of the assembled sequence of repeats that made this work possible.
And why read these fragments with repeats at all?
To better understand how our genome works. For example, the male Y chromosome we have just managed to read is such a difficult case. It consists of repeats by two-thirds! That is, until now, we did not understand what a large part of the chromosome, which is responsible for the sex of a human being, looked like. Much was known before, but it is only now that we are able to see the full picture.
In addition, it is these fragments that are important for understanding evolution. The fact is that mutations in genes occur quite rarely. Significant changes may not be observed for hundreds of thousands of years. However, they occur relatively often in repeats. This is the most rapidly evolving part of DNA. They are therefore much richer in information in this sense. Evolutionary researchers now have to reap a good harvest in this field. It will allow them to better understand how the human has changed over the past tens and hundreds of thousands of years.
Why else is this research important?
As I have already said, the Y chromosome determines the sex of a human being. Roughly speaking, a man is a human who has a Y chromosome. Knowing how it works is important for understanding the sex determination process. Many other species differ from humans in this sense.
Secondly, there is one more important thing. It has long been known that Y chromosomes are sometimes lost in individual cells in old age. It occurs in about one in five old men. Most often, blood cells are lacking Y chromosomes. Moreover, those very huge databases of genetic data that we talked about in the beginning demonstrated that the loss of Y chromosomes was associated with a shorter life expectancy.
In addition, Y chromosomes are often not found in male tumour cells. It is possible that their loss is associated with the likelihood and severity of cancer. But, as of today, that is all speculation.
And in order to confirm it or refute it, a completely decoded Y chromosome was required, wasn’t it?
Naturally. The fact is that in different human populations, for example, ethnic groups, the Y chromosome is very different. A question therefore arises. What if some evolutionary events that occur in this part of the genome just lead to the fact that in some people Y is lost in old age? After all, this happens only with 20% of men, and not with 100%. It still needs to be tested, but such a hypothesis comes to mind first.
Could you please tell us about your role in the research?
I was responsible for the analysis of centromeres. In fact, I have been doing that since the very start of my scientific career period. What is a centromere? The chromosome is often drawn in the form of a bow. The centromere is the knot in the middle of this bow. It is this place that special microtubules are attached to during cell division. These tubules pull the doubled chromosomes into two new cells. If these tubules are somehow poorly attached, an extra chromosome can get into one of the cells. This is the reason why the Down syndrome occurs.
Actually, these centromeres consist of repeats. Therefore, my colleagues and I decided how to combine them correctly and then see the result. By the way, anyone can now see our research findings on a special website. There, the chromosomes we have read are fully visualised, and different repeats are marked with different colours. Now, you can see the big picture and you can zoom in a section or even look at individual nucleotide sequences.
Is your work finished here?
No, it is not. The fact is that so far we have solved only a simplified problem. We have collected a haploid set of chromosomes. That is, one copy of each chromosome. For that, a special cell line was used. However, in living people, each chromosome has two copies, and they can slightly differ from each other. Our next task is therefore to assemble such a complete genome, with two versions of each chromosome. When we learn how to do such an assembly, the geneticists will be given a free hand.
Moreover, I work with a related consortium that has taken on the task of cataloguing the diversity of human genomes. They are planning to map not a single genome, but to take samples from 350 people belonging to different ethnic groups and make complete assemblies of these genomes, which can then be compared. We need some 10 years to complete this task, if we mean not only to compare, but also to digest the results of this comparison, to draw conclusions.
And why is this needed?
This is necessary because people, as it turns out, are very different (laughs). An average genome does not explain the full variety of biochemical and biophysical phenomena that occur in our body. Why do representatives of different ethnic groups differ, for example, in the results of medical tests? All that needs to be studied. I can already say that the centromeres of people of different origins are very different. It is not improbable that that this has certain medical consequences.