Arrows of Time for Large Language Models
Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler
TL;DR
The paper investigates Arrow of Time (AoT) in autoregressive Large Language Models by contrasting forward next-token prediction with backward previous-token prediction, formalizing time-directional measures $\mathbb{P}_n^{\rightarrow}$ and $\mathbb{P}_n^{\leftarrow}$ and their cross-entropy losses. Through large-scale natural-language experiments across languages, architectures, and context lengths, it reveals a consistent FW AoT where forward losses are lower than backward ones, with magnitude scaling with context length and model size. It complements these results with synthetic computability-theoretic constructions (e.g., $p\times q\leftrightarrow\mathrm{rev}(pq)$ and linear sparse circuits) that illustrate how information-preserving but hard-to-invert mappings can produce AoT due to sparsity and computational complexity. The work argues that AoT reflects fundamental long-range structure in natural language data and provides a framework linking causality, information theory, and complexity to learnability differences under time reversal, with potential implications for understanding language structure and guiding model design. Key implications include the universality of FW AoT across languages and models, the role of long-range dependencies in AoT magnitudes, and the possibility of connecting AoT to broader notions of complexity and irreversibility in data-driven learning.
Abstract
We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.
