Table of Contents
Fetching ...

Rethinking the Role of Text Complexity in Language Model Pretraining

Dan John Velasco, Matthew Theodore Roque

TL;DR

Rethinking text complexity in pretraining shows that reducing surface-level complexity (via LLM-based simplification) preserves overall downstream performance across model sizes, while affecting perplexity and zero-shot behavior in task-type–dependent ways. By training from scratch on parallel original and simplified corpora, the study isolates the impact of text complexity from content, revealing that simpler text enhances linguistic-knowledge tasks in zero-shot settings, whereas more complex text benefits world-knowledge and entity-tracking tasks. The work demonstrates that data diversity influences transfer and zero-shot performance differently, informing data-curation strategies that prioritize knowledge coverage before introducing surface-level variation. Practically, these findings guide targeted data design for specific objectives, while acknowledging limitations like simplification imperfections and scale constraints.

Abstract

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M-500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

Rethinking the Role of Text Complexity in Language Model Pretraining

TL;DR

Rethinking text complexity in pretraining shows that reducing surface-level complexity (via LLM-based simplification) preserves overall downstream performance across model sizes, while affecting perplexity and zero-shot behavior in task-type–dependent ways. By training from scratch on parallel original and simplified corpora, the study isolates the impact of text complexity from content, revealing that simpler text enhances linguistic-knowledge tasks in zero-shot settings, whereas more complex text benefits world-knowledge and entity-tracking tasks. The work demonstrates that data diversity influences transfer and zero-shot performance differently, informing data-curation strategies that prioritize knowledge coverage before introducing surface-level variation. Practically, these findings guide targeted data design for specific objectives, while acknowledging limitations like simplification imperfections and scale constraints.

Abstract

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M-500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

Paper Structure

This paper contains 30 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: (Top) Perplexity (PPL) degrades faster for models trained on fwedu_hw (human-written) than on fwedu_simp (simplified) as model size decreases, suggesting that smaller models handle lower-complexity text more effectively. (Bottom) Average performance across 7 language tasks remains similar across data setups suggesting text complexity has limited impact on general language understanding.
  • Figure 2: Corpus Features distribution. First row shows metrics of fwedu_simp to fwedu_hw. Second row are pairwise metrics except for Flesch Reading Ease (FRE) which only requires one input. The first row suggests fwedu_simp is shorter, has more sentences, uses simpler structures, and more common words. The second row shows that fwedu_simp is semantically similar to fwedu_hw, with low word-order overlap (low ROUGE-2), moderate preservation of idea flow and structure (moderate ROUGE-L), and clearly higher FRE, indicating systematic differences in readability. For visualization, we removed outliers, which account for only 2.9% of the data (see Appendix \ref{['sec:ratio-outliers']} for definition and examples of outliers).