Scientists at St Petersburg University and the Institute for Intelligent Information Processing of ORT Braude College, Israel, offered a new solution for computer research of text authorship and style. The solution is based on modelling the dynamic process of text creation.

A unique approach enabled the scientists to analyse the works of J. R. R. Tolkien, Isaac Asimov, Arthur C. Clarke and many other renowned authors by exploring how their individual style had been changing throughout the years. The findings of the research group's recent study have been published in the Pattern Recognition journal of the Elsevier publishing house.

The paper was written by SPbU postdoc Konstantin Amelin, Candidate of Science in Physics and Mathematics; SPbU Professor Oleg Granichin; Natalya Kizhaeva, an aspirantura programme student at the SPbU Department of System Programming; and Zeev Volkovich, Ph. D., Head of the Institute for Intelligent Information Processing at ORT Braude College, Israel, Dean of Computer Faculty of the ORT Braude College.

The mathematicians selected some well-known works of literature: Foundation, a cycle of seven science fiction novels by Isaac Asimov; The Forsyte Saga, a series of works by John Galsworthy; The Lord of the Rings, a novel in three volumes by J. R. R. Tolkien; and other books. In their previous papers they had already analysed the works of J. K. Rowling (the Harry Potter series). The researchers are interested in large arrays of texts that the author has been creating over time: the mathematical approach makes it possible to see how the author's individual style has been changing.

One can work with big data using traditional methods, i. e. classification, searching for related elements, similarities or groups. We introduced a new big data analysis algorithm and suggested exploring the way it was being created.

SPbU Professor Oleg Granichin, Doctor of Science in Physics and Mathematics

"Any text was either written, or pronounced, or any other way recorded by someone. This process also has its particular characteristics manifesting themselves, for example, in the author's individual style. Today, we do not just study what data looks like, but reveal the characteristics of the process of creating it. So far no one has analysed texts this way," Oleg Granichin noted. 

In their paper the researchers compared the three books from the The Lord of the Rings series by J. R. R. Tolkien with his other works, namely The Hobbit and The Silmarillion. The method determined quite accurately that the first story was written by the same author who had created the trilogy, yet The Silmarillion differs greatly in style. This is because the book was published after the author's death: the collection of myths and legends of Middle-earth was completed by Christopher Tolkien, the son of John Tolkien, who had been studying his father's drafts for several years.

 "There are notable differences in style and in the works of one author," Natalya Kizhaeva adds. "For instance, the fourth book of the Foundation cycle was written by Isaac Asimov almost 30 years after the third one had been completed — his fans insisted on that. Our method allowed us to divide the seven books of the series into two clusters: those created before 1953 and the ones written after 1982. Over the 30 years the author changed, as well as his environment, his vision of life and, as a consequence, his style of writing."

The employees of the SPbU Research Laboratory for Analysis and Modelling of Social Processes are working on other projects that cut across the humanities and the exact sciences. In July 2016, using a unique technology of manuscript analysis, they managed to prove that the manuscript Al-Khitat ("Description of Egypt") kept at the University of Michigan was very likely to be an original work of the famous Egyptian historian Al-Maqrizi. Prior to that, it had been considered a copy thereof.

Not only sequences of symbols in the text and in the word but also n-gram sequences (connected strings of symbols) served as the basic data for the method of modelling the dynamic process of text creation presented in the paper. For example, if n = 3, instead of six "_mama_" symbols, the computer programme will allocate the following trigrams in the text: "_ma", "mam", "ama", "ma_". Then the document is divided into sub-documents forming an ordered sequence of n-gram occurrence, where a relation is sought between each of the sub-documents and its "neighbours". For that, the methods developed earlier in the signal processing theory are used. They distinguish frequency characteristics in data sequences. The new method determines the individual "frequency characteristics" of the author's style by analogy with the frequencies of physical waves recorded by special-purpose devices.

The authors of the algorithm are planning to test the methodology on the works of the Russian literature as it can be applied to texts written in other languages ​​using the Latin alphabet, Cyrillic alphabet and Arabic script.

The researchers note that their invention can be of help in analysing not only literary works but also unstructured texts. For example, this method will be useful when processing data arrays arriving at operator consoles or at various customer service call centres. The Israeli colleagues apply this invention to determine artificially generated texts written not by a person, but by a machine. For instance, there are programmes fabricating texts that are similar to real scientific papers sometimes accepted for publication in well-known journals. The method makes it possible to distinguish such articles from human-created texts with greater accuracy.