L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
TL;DR
The paper tackles long-context language modeling by introducing a bipartite mutual information scaling law that captures multi-token dependencies across long sequences. It defines the L$^2$M condition, linking the required growth of a model's history state to the observed scaling $I^{ ext{BP}}_{L/2;L}\sim L^{eta}$, and proves that history-state dimensionality must grow at least as fast as this scaling to maintain MI-capability. The authors validate the framework with both synthetic sub-volume Gaussian data and real-language datasets (PG19, Wikipedia), showing that bipartite MI scales sub-linearly with sequence length and that transforms naturally satisfy L$^2$M as single models while fixed-state architectures require model-series growth. They compare bipartite MI with traditional two-point MI, illustrating that relying on $I^{ ext{TP}}$ can misrepresent long-range dependencies, and provide practical implications for architectural design and efficiency. Overall, the work offers a principled, information-theoretic foundation to understand and engineer long-context capable models beyond the quadratic costs of standard transformers, with potential applicability to diverse sequential domains.
Abstract
We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.
