Computer-reconstructed proto-languages


Scientists create automated ‘time machine’ to reconstruct ancient languages

By Yasmin Anwar, Media Relations | February 11, 2013

Ancient languages hold a treasure trove of information about the culture, politics and commerce of millennia past. Yet, reconstructing them to reveal clues into human history can require decades of painstaking work. Now, scientists at the University of California, Berkeley, have created an automated “time machine,” of sorts, that will greatly accelerate and improve the process of reconstructing hundreds of ancestral languages.

In a compelling example of how “big data” and machine learning are beginning to make a significant impact on all facets of knowledge, researchers from UC Berkeley and the University of British Columbia have created a computer program that can rapidly reconstruct “proto-languages” – the linguistic ancestors from which all modern languages have evolved. These earliest-known languages include Proto-Indo-European, Proto-Afroasiatic and, in this case, Proto-Austronesian, which gave rise to languages spoken in Southeast Asia, parts of continental Asia, Australasia and the Pacific.

And, of course, Proto-Semitic.

”What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time,” said Dan Klein, an associate professor of computer science at UC Berkeley and co-author of the paper published online today (Feb. 11) in the journal Proceedings of the National Academy of Sciences.

The research team’s computational model uses probabilistic reasoning – which explores logic and statistics to predict an outcome – to reconstruct more than 600 Proto-Austronesian languages from an existing database of more than 140,000 words, replicating with 85 percent accuracy what linguists had done manually. While manual reconstruction is a meticulous process that can take years, this system can perform a large-scale reconstruction in a matter of days or even hours, researchers said.

Not only will this program speed up the ability of linguists to rebuild the world’s proto-languages on a large scale, boosting our understanding of ancient civilizations based on their vocabularies, but it can also provide clues to how languages might change years from now.

“Our statistical model can be used to answer scientific questions about languages over time, not only to make inferences about the past, but also to extrapolate how language might change in the future,” said Tom Griffiths, associate professor of psychology, director of UC Berkeley’s Computational Cognitive Science Lab and another co-author of the paper.


Despite the try-too-hard time-machine metaphor (no flux capacitor is involved), this sounds like a useful development. It is not a panacea for the reconstruction of proto-languages, nor does it pretend to be, but anything that speeds up the slow and painstaking process of gathering, organizing, and categorizing the raw data is worthwhile and commendable. The claims here (apart from the time travel stuff) are relatively measured in comparison with the reports a few years ago about a computer program that could decipher Ugaritic.

The idea that this program could predict language changes should be handled cautiously. It may well be able to indicate a range of possible developments (conditional predictions), but any attempt to make precise extrapolations (unconditional predictions) about how a language will change in the future will quickly run up against far too many sensitive variables to compute (the butterfly effect), involving numerous intangibles about human society. Karl Popper had already described this problem in the days before Chaos Theory, and he also pointed out that the process of scientific advancement of knowledge is itself unpredictable and therefore injects unpredictable variables into any attempts to predict the future.