Table of Contents
Fetching ...

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chloé Clavel

TL;DR

This work addresses the data-scarcity problem in language modeling by examining the long-term linguistic effects of training on text generated by prior models, using a recursive finetuning framework. It introduces three complementary linguistic-diversity metrics spanning lexical, semantic, and syntactic dimensions and tests them across three generation tasks with varying entropy levels. Across $n$ iterations, the study finds a consistent decline in lexical and especially syntactic diversity, while semantic diversity remains comparatively stable, highlighting potential risks to linguistic richness in long-term synthetic-data training. The results caution designers to balance performance with diversity preservation and motivate methods to maintain linguistic variety in future synthetic-data–driven training regimes.

Abstract

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

TL;DR

This work addresses the data-scarcity problem in language modeling by examining the long-term linguistic effects of training on text generated by prior models, using a recursive finetuning framework. It introduces three complementary linguistic-diversity metrics spanning lexical, semantic, and syntactic dimensions and tests them across three generation tasks with varying entropy levels. Across iterations, the study finds a consistent decline in lexical and especially syntactic diversity, while semantic diversity remains comparatively stable, highlighting potential risks to linguistic richness in long-term synthetic-data training. The results caution designers to balance performance with diversity preservation and motivate methods to maintain linguistic variety in future synthetic-data–driven training regimes.

Abstract

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
Paper Structure (33 sections, 5 figures, 3 tables)

This paper contains 33 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our recursive tuning-generation process. Beginning with authentic, human-curated Data (0), Base (1) model undergoes finetuning to develop Model (1), which is the first model subject to our language diversity research. Subsequently, we use Model (1) to create synthetic Data (1) to train a successor Model (2) of the next generation, on the basis of Base (2) model. The process continues for n iterations. Base (1), Base (2), ..., Base (n) follow the same model architecture but are independently initialized instances.
  • Figure 2: Illustration of linguistic diversity variation for the story generation task under different recursion settings. Since there is a strong correlation between different diversity metrics of the same aspect, we only report one per aspect: Distinct-3 for lexical diversity and D_syn_c for syntactic diversity.
  • Figure 3: Histograms illustrating word frequency in texts produced across various iterations for the story generation task. For visual clarity, the x-axis, representing word frequency, is truncated at 100, though the actual distribution extends further. A noticeable trend is the diminishing presence of low-frequency, "unique" words in the synthetic text relative to human-generated text, a pattern that intensifies with each iteration. This trend highlights a progressive decline of lexical diversity in the generated text.
  • Figure 4: T-SNE visualization of dependency tree embeddings derived from sentences generated in successive iterations of our tuning-generation process. The visualization clearly depicts how, over time, the spatial distribution of the embeddings becomes increasingly compact. This decreasing spread is indicative of declining syntactic diversity.
  • Figure 5: T-SNE visualization of sentence embeddings from text generated across different iterations. There is a noticeable decrease in dispersion over iterations, indicating a reduction in semantic diversity, though this change is less pronounced compared to that of syntactic diversity.