Table of Contents
Fetching ...

L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling

Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić

TL;DR

The paper tackles long-context language modeling by introducing a bipartite mutual information scaling law that captures multi-token dependencies across long sequences. It defines the L$^2$M condition, linking the required growth of a model's history state to the observed scaling $I^{ ext{BP}}_{L/2;L}\sim L^{eta}$, and proves that history-state dimensionality must grow at least as fast as this scaling to maintain MI-capability. The authors validate the framework with both synthetic sub-volume Gaussian data and real-language datasets (PG19, Wikipedia), showing that bipartite MI scales sub-linearly with sequence length and that transforms naturally satisfy L$^2$M as single models while fixed-state architectures require model-series growth. They compare bipartite MI with traditional two-point MI, illustrating that relying on $I^{ ext{TP}}$ can misrepresent long-range dependencies, and provide practical implications for architectural design and efficiency. Overall, the work offers a principled, information-theoretic foundation to understand and engineer long-context capable models beyond the quadratic costs of standard transformers, with potential applicability to diverse sequential domains.

Abstract

We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.

L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling

TL;DR

The paper tackles long-context language modeling by introducing a bipartite mutual information scaling law that captures multi-token dependencies across long sequences. It defines the LM condition, linking the required growth of a model's history state to the observed scaling , and proves that history-state dimensionality must grow at least as fast as this scaling to maintain MI-capability. The authors validate the framework with both synthetic sub-volume Gaussian data and real-language datasets (PG19, Wikipedia), showing that bipartite MI scales sub-linearly with sequence length and that transforms naturally satisfy LM as single models while fixed-state architectures require model-series growth. They compare bipartite MI with traditional two-point MI, illustrating that relying on can misrepresent long-range dependencies, and provide practical implications for architectural design and efficiency. Overall, the work offers a principled, information-theoretic foundation to understand and engineer long-context capable models beyond the quadratic costs of standard transformers, with potential applicability to diverse sequential domains.

Abstract

We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (LM) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.

Paper Structure

This paper contains 47 sections, 6 theorems, 50 equations, 13 figures, 2 tables.

Key Result

Theorem 5.2

The bipartite mutual information that a model can capture is bounded by the size of its history state: where $C$ is some constant and $M$ denotes the vocabulary size.

Figures (13)

  • Figure 1: (a) The bipartite mutual information between two text segments scales as a power law (sub-volume law) with sequence length $L$. (b) In autoregressive models, conditional distributions are parameterized through the history state ${\boldsymbol{z}}$, the latent variables that store past information. Examples of the history state include the recurrent states in state-space models or recurrent neural networks, and the key-value pairs in transformers. (c) The maximum bipartite mutual information a model can express scales with the dimensionality of its history state, $\dim({\boldsymbol{z}})$. To model long contexts effectively, $\dim({\boldsymbol{z}})$ must grow at least as fast as the power-law scaling of the true bipartite mutual information.
  • Figure 2: (a) Illustration of bipartite and two-point mutual information. The bipartite mutual information measures statistical dependence between two adjacent segments within a text block of length $L$, whereas the two-point mutual information measures the dependence between two tokens separated by a distance $d$. (b) Estimates of bipartite mutual information using LLaMA 3.1 405B model llama405 on PG19 dataset pg19 of pre-1919 books. (c) Estimates of two-point mutual information on PG19 dataset. See Appx. \ref{['app:mi_other_llm']}, \ref{['app:mi_other_l']}, and \ref{['app:two-point']} for additional results.
  • Figure 3: Evaluation of KL-divergence across model architectures trained on synthetic data that satisifes the bipartite mutual information scaling. (a, b) Average KL-divergence per token for models trained on different sequence lengths. (c) Average KL-divergence per token as a function of the ratio between bipartite mutual information and Mamba recurrent state sizes.
  • Figure 4: Position-wise conditional negative log likelihood (NLL) evaluation for models trained on 8192-token sequences on the PG19 dataset pg19.
  • Figure B.1: Bipartite mutual information estimation using (left) LLaMA 3.1 405B on the Wikipedia dataset and (right) Deepseek V3 Base model on the PG19 dataset. All direct measurements include the bias correction described in Appx. \ref{['app:mi_direct']}.
  • ...and 8 more figures

Theorems & Definitions (16)

  • Definition 4.1: Bipartite Mutual Information [Fig. \ref{['fig:real_data_measurement']}(a)]
  • Definition 4.2: Two-point Mutual Information [Fig. \ref{['fig:real_data_measurement']}(a)]
  • Definition 5.1
  • Theorem 5.2
  • proof
  • Definition 5.3
  • Theorem 5.4: L$^2$M Condition for Single Models
  • proof
  • Definition 5.5
  • Theorem 5.6: L$^2$M Condition for Model Series
  • ...and 6 more