The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chloé Clavel
TL;DR
This work addresses the data-scarcity problem in language modeling by examining the long-term linguistic effects of training on text generated by prior models, using a recursive finetuning framework. It introduces three complementary linguistic-diversity metrics spanning lexical, semantic, and syntactic dimensions and tests them across three generation tasks with varying entropy levels. Across $n$ iterations, the study finds a consistent decline in lexical and especially syntactic diversity, while semantic diversity remains comparatively stable, highlighting potential risks to linguistic richness in long-term synthetic-data training. The results caution designers to balance performance with diversity preservation and motivate methods to maintain linguistic variety in future synthetic-data–driven training regimes.
Abstract
This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
