Towards Optimal Learning of Language Models

Yuxian Gu; Li Dong; Yaru Hao; Qingxiu Dong; Minlie Huang; Furu Wei

Towards Optimal Learning of Language Models

Yuxian Gu, Li Dong, Yaru Hao, Qingxiu Dong, Minlie Huang, Furu Wei

TL;DR

This work proposes a principled theory for accelerating language-model learning by treating LM training as lossless data compression and minimizing the area under the dsr-loss curve $AUC$. The central result, the Learning Law, shows that in the optimal learning regime all non-noisy training examples contribute equally, implying a dynamic data-reweighting policy that emphasizes highly contributive samples while avoiding overfitting. The authors validate the theory with gradient-flow analysis and empirical experiments on Perceptron and Transformer models (TinyStories), demonstrating substantial speedups of $5.50\times$ and $2.41\times$ respectively and improvements in the LM scaling-law coefficients. The work connects data-selection principles, optimization dynamics, and information-theoretic views to offer a coherent path toward efficient LM training, while acknowledging limitations and outlining directions for scaling to larger models and more practical training setups.

Abstract

This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning of LMs. We first propose an objective that optimizes LM learning by maximizing the data compression ratio in an "LM-training-as-lossless-compression" view. Then, we derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. The theorem is then validated by experiments on a linear classification and a real-world language modeling task. Finally, we empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs, indicating great promise and significance for designing practical learning acceleration methods. Our code can be found at https://aka.ms/LearningLaw.

Towards Optimal Learning of Language Models

TL;DR

This work proposes a principled theory for accelerating language-model learning by treating LM training as lossless data compression and minimizing the area under the dsr-loss curve

. The central result, the Learning Law, shows that in the optimal learning regime all non-noisy training examples contribute equally, implying a dynamic data-reweighting policy that emphasizes highly contributive samples while avoiding overfitting. The authors validate the theory with gradient-flow analysis and empirical experiments on Perceptron and Transformer models (TinyStories), demonstrating substantial speedups of

and

respectively and improvements in the LM scaling-law coefficients. The work connects data-selection principles, optimization dynamics, and information-theoretic views to offer a coherent path toward efficient LM training, while acknowledging limitations and outlining directions for scaling to larger models and more practical training setups.

Abstract

Paper Structure (42 sections, 3 theorems, 22 equations, 14 figures, 1 table, 2 algorithms)

This paper contains 42 sections, 3 theorems, 22 equations, 14 figures, 1 table, 2 algorithms.

Introduction
Problem Formulation
Theory for Optimal Learning of LMs
Objective: Maximizing Compression Ratio
Learning Law
Discussion
Theorem \ref{['trm:main']} suggests a matching of the local and global learning.
The optimal learning policy establishes a dynamic data re-weighting strategy.
Theorem \ref{['trm:main']} is a necessary condition for the optimal learning dynamics.
Experiments
Finding the Optimal Learning Policy
Experimental Setup
Perceptron Linear Classification.
Transformer Language Modeling.
Learning Policy Optimization Results
...and 27 more sections

Key Result

Theorem 3.1

When an LM is trained with an optimal learning policy, which yields a learning process corresponding to a maximum compression ratio on the desired data distribution, the following condition holds for $0 < t \le T$ and any $m$, $n$ such that $\gamma_m(t) > 0$, $\gamma_n(t) > 0$: where $\nabla L=\nabla L^{\mathrm{dsr}}({\bm{\theta}}(t))=\nabla\frac{1}{K}\sum^{K}_{k=1} l(x^{\mathrm{dsr}}_k, {\bm{\th

Figures (14)

Figure 1: Our objective is to minimize the area under loss curve, which is equivalent to maximizing the compression ratio of training corpus in the "LM-training-as-lossless-compression" view. A learning law is proposed to reveal the training dynamics of the above optimal learning.
Figure 2: Optimal learning gets the theoretical speedup upper bound of Transformer LM training on TinyStories corpus tinystories.
Figure 3: The (near-)optimal LM learning improves the scaling laws scaling_law over conventional LM training. The coefficients $B,\beta$ are used to fit the loss curves in Figure \ref{['fig:exp']}, i.e., $\mathrm{Loss} = L_0 + \left( B/t \right)^{\beta}$ when $t>t_0$. See Section \ref{['sec:scaling_law']} for details.
Figure 4: A: 3-D illustration of Learning Law (Theorem \ref{['trm:main']}). In the optimal learning process, all training examples should have the same contribution to LM learning, where the contribution is defined as the dot-product of the gradient on individual samples ($\nabla l_m$, $\nabla l_n$, and $\nabla l_k$) and the gradient of a desired loss ($\nabla L$). See Section \ref{['sec:derive']} for rigorous notation definitions. B: Experimental evidence of Learning Law. When LM learning approaches the optimum, the similarity of example contributions tends to $+\infty$, which means all examples have the same contribution to the LM.
Figure 5: Learning policy optimization results in Perceptron linear classification (a) and Transformer language modeling tasks (b). We plot the learning policy optimization loss $J(\gamma)$ (solid lines), defined in Equation \ref{['eq:method']}, which represents the area under the curve (AUC) of the desired Perceptron or Transformer loss. We also show the corresponding compression ratio of the training process (dashed lines) in an "LM-as-Lossless-Compression" view. The optimization starts from conventional learning and smoothly converges to near-optimal learning with low loss AUC and high comprehension rate.
...and 9 more figures

Theorems & Definitions (11)

Theorem 3.1: Learning Law
proof
proof
proof
Theorem A.1
proof
Remark 1
Remark 2
Remark 3
Lemma B.1
...and 1 more

Towards Optimal Learning of Language Models

TL;DR

Abstract

Towards Optimal Learning of Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (11)