ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining
Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach
TL;DR
ESLM tackles the inefficiency of uniform token treatment in large-scale pretraining by introducing token-level risk-aware selective learning. By scoring tokens with per-token risk $S_\theta(x_j)$ and filtering via value-at-risk thresholds, ESLM reshapes the training distribution to emphasize informative tokens, achieving substantial compute efficiency without sacrificing, and often improving, perplexity and downstream results. The framework connects to distributionally robust optimization and includes adaptive (Ada-Eslm) and knowledge-distillation (Eslm-Kd) extensions, demonstrating consistent gains across model sizes and data mixtures. Empirically, ESLM reduces training FLOPs and enables larger batch sizes while maintaining or boosting performance on a broad set of benchmarks, highlighting its practical impact for scalable and robust LLM pretraining.
Abstract
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.
