A collaborative study between researchers at Fudan, Harvard, and Stony Brook University has utilized advanced artificial intelligence and natural language processing to uncover a hidden statistical structure shared by 22 distinct human languages. By analyzing linguistic data spanning from the Middle Ages to the present day, the team identified that word evolution follows predictable, hierarchical patterns reminiscent of biological evolution. The findings, published in Proceedings of the Royal Society B Biological Sciences, suggest that the emergence of new concepts and vocabulary is governed by a stochastic mathematical process, providing a rigorous framework for understanding how human culture transforms over centuries.
LONG ISLAND, N.Y. — While human languages are often viewed as distinct products of isolated cultures, a groundbreaking study has revealed that they all share a singular mathematical heartbeat. Researchers leveraging machine learning to navigate the vast “semantic space” of 22 different languages have identified universal patterns that govern how words are born, clustered, and sustained over time. The study provides a significant leap forward in the field of quantitative linguistics, suggesting that the expansion of human knowledge follows a predictable, shared geometry.
Mapping the High-Dimensional Semantic Space
To decode the evolution of language, the research team—led by senior author Steven Skiena—moved beyond simple word counts. Instead, they utilized word embeddings, a core technology in Natural Language Processing (NLP). Word embeddings translate vocabulary into numerical vectors, placing every word within a high-dimensional “semantic space” (typically 300 dimensions). In this mathematical environment, words with similar meanings—such as “king” and “queen” or “computer” and “software”—are represented as points located near one another.
“In essence, our paper asks how the vocabulary of different languages are distributed in this feature space, and what kind of mathematical process would create a similar distribution,” Skiena stated. This approach allowed the team to compare the structural integrity of languages as diverse as English, Mandarin, and Russian over a period of seven years of active research.
Universal Clustering and Taylor’s Law
The researchers uncovered several “universal facts” about human culture that appear to transcend linguistic boundaries. The most prominent of these is the tendency for high-frequency, “popular” words to cluster together in specific regions of the semantic space. This creates a dense core of common vocabulary surrounded by more specialized, lower-frequency terms.
Furthermore, the team discovered that language adheres to Taylor’s Law, a power-law relationship originally identified in ecology to describe how populations fluctuate. In linguistics, this law connects the mean and variance of word counts when sorted by their semantic meaning and historical appearance. The presence of Taylor’s Law suggests that the “population” of words in a language behaves much like a biological ecosystem, with new words emerging in rapid bursts—a phenomenon the researchers noted is strikingly similar to the “punctuated equilibrium” seen in biological evolution.
The Stochastic Model of Cultural Growth
One of the study’s most significant contributions is the development of a surprisingly simple mathematical model that replicates these complex historical processes. By combining two distinct mathematical concepts, the researchers were able to simulate how languages grow across both semantic dimensions and historical time:
- Cumulative-Advantage Process: Often referred to as “the rich get richer,” this principle explains why certain words become more entrenched and popular the more they are used.
- Von Mises–Fisher Distribution: A probability distribution on a sphere in high-dimensional space, used here to model the direction and spread of new concepts within the semantic map.
Sergiy Verstyuk, co-first author of the paper, noted that this model accounts for findings across 300 dimensions of meaning. It successfully explains why new words are not created in isolation but are generally generated in “bursts” alongside other related concepts, reflecting societal shifts like the Industrial Revolution or the Digital Age.
Implications for Anthropology and AI
The study suggests that the underlying mechanics of language are not merely accidental artifacts of history but are driven by fundamental mathematical rules. The researchers believe that these patterns are likely not limited to language alone but may extend to other domains of human culture, such as art, technology, and social structures.
The use of AI as a tool for fundamental research rather than just a utility marks a shift in the study of the humanities. “We remain excited about the possibilities of using AI-generated embeddings as a tool for fundamental research in understanding historical processes in cultural evolution,” Skiena added. By viewing language through the lens of spatial statistics and machine learning, scholars can now quantify the invisible forces that have shaped human communication since the Middle Ages.