Rethinking the Role of Text Complexity in Language Model Pretraining
Dan John Velasco, Matthew Theodore Roque
TL;DR
Rethinking text complexity in pretraining shows that reducing surface-level complexity (via LLM-based simplification) preserves overall downstream performance across model sizes, while affecting perplexity and zero-shot behavior in task-type–dependent ways. By training from scratch on parallel original and simplified corpora, the study isolates the impact of text complexity from content, revealing that simpler text enhances linguistic-knowledge tasks in zero-shot settings, whereas more complex text benefits world-knowledge and entity-tracking tasks. The work demonstrates that data diversity influences transfer and zero-shot performance differently, informing data-curation strategies that prioritize knowledge coverage before introducing surface-level variation. Practically, these findings guide targeted data design for specific objectives, while acknowledging limitations like simplification imperfections and scale constraints.
Abstract
Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a large language model, pretrain causal models (28M-500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.
