Table of Contents
Fetching ...

ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach

TL;DR

ESLM tackles the inefficiency of uniform token treatment in large-scale pretraining by introducing token-level risk-aware selective learning. By scoring tokens with per-token risk $S_\theta(x_j)$ and filtering via value-at-risk thresholds, ESLM reshapes the training distribution to emphasize informative tokens, achieving substantial compute efficiency without sacrificing, and often improving, perplexity and downstream results. The framework connects to distributionally robust optimization and includes adaptive (Ada-Eslm) and knowledge-distillation (Eslm-Kd) extensions, demonstrating consistent gains across model sizes and data mixtures. Empirically, ESLM reduces training FLOPs and enables larger batch sizes while maintaining or boosting performance on a broad set of benchmarks, highlighting its practical impact for scalable and robust LLM pretraining.

Abstract

Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.

ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

TL;DR

ESLM tackles the inefficiency of uniform token treatment in large-scale pretraining by introducing token-level risk-aware selective learning. By scoring tokens with per-token risk and filtering via value-at-risk thresholds, ESLM reshapes the training distribution to emphasize informative tokens, achieving substantial compute efficiency without sacrificing, and often improving, perplexity and downstream results. The framework connects to distributionally robust optimization and includes adaptive (Ada-Eslm) and knowledge-distillation (Eslm-Kd) extensions, demonstrating consistent gains across model sizes and data mixtures. Empirically, ESLM reduces training FLOPs and enables larger batch sizes while maintaining or boosting performance on a broad set of benchmarks, highlighting its practical impact for scalable and robust LLM pretraining.

Abstract

Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.

Paper Structure

This paper contains 47 sections, 13 equations, 15 figures, 13 tables, 3 algorithms.

Figures (15)

  • Figure 1: The illustration of Eslm approach.Eslm computes token-level risk scores and retains only high-risk tokens via a value-at-risk threshold. This reshapes the effective training distribution and loss by focusing computational resources on tokens with higher learning value.
  • Figure 2: Training FLOPs ($\downarrow$) required to reach target validation (log) perplexity. We report the training FLOPs required by the methods with model sizes {124M, 350M, 774M} to achieve a target validation loss threshold across datasets. Eslm reduces training cost by focusing optimization on the high-risk tokens, eliminating redundant gradient computation. This efficiency gain holds consistently across model scales. See \ref{['app:val-loss-vs-flops-results']} for the convergence of validation loss versus training FLOPs.
  • Figure 3: Validation loss vs training FLOPs. We report convergence of validation loss vs training FLOPs (axes are in log scale for better visibility) of models trained on SlimPajama-6B-Unif mixture. Eslm variants with $\alpha=0.1$ consistently reach lower loss with fewer FLOPs, with increased efficiency gains as the model scales. See Appendix \ref{['app:val-loss-vs-flops-results']} for results on other pretraining corpora.
  • Figure 4: 5-shot accuracy (norm) ($\uparrow$) performance on HellaSwag throughout training. Eslm variants discover higher accuracy levels than baselines, with particular gains in the later training stages.
  • Figure 5: Extended analyses demonstrating use cases of Eslm.(a):Eslm enables batch scaling, improving generalization accuracy ($\uparrow$) over baselines under the same compute budget. (b):Ada-Eslm reduces training FLOPs required to reach the target validation (log) perplexity ($\downarrow$) by adaptively tuning the $\alpha$ level based on training dynamics. (c): In risk-aware knowledge distillation for 774M, Eslm converges the target validation (log) perplexity with substantially less compute (FLOPs) than the baseline models. (d): Varying the $\alpha$ level enables flexible control over the trade-off between training efficiency and model quality.
  • ...and 10 more figures