Arrows of Time for Large Language Models

Vassilis Papadopoulos; Jérémie Wenger; Clément Hongler

Arrows of Time for Large Language Models

Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

TL;DR

The paper investigates Arrow of Time (AoT) in autoregressive Large Language Models by contrasting forward next-token prediction with backward previous-token prediction, formalizing time-directional measures $\mathbb{P}_n^{\rightarrow}$ and $\mathbb{P}_n^{\leftarrow}$ and their cross-entropy losses. Through large-scale natural-language experiments across languages, architectures, and context lengths, it reveals a consistent FW AoT where forward losses are lower than backward ones, with magnitude scaling with context length and model size. It complements these results with synthetic computability-theoretic constructions (e.g., $p\times q\leftrightarrow\mathrm{rev}(pq)$ and linear sparse circuits) that illustrate how information-preserving but hard-to-invert mappings can produce AoT due to sparsity and computational complexity. The work argues that AoT reflects fundamental long-range structure in natural language data and provides a framework linking causality, information theory, and complexity to learnability differences under time reversal, with potential implications for understanding language structure and guiding model design. Key implications include the universality of FW AoT across languages and models, the role of long-range dependencies in AoT magnitudes, and the possibility of connecting AoT to broader notions of complexity and irreversibility in data-driven learning.

Abstract

We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.

Arrows of Time for Large Language Models

TL;DR

and

and their cross-entropy losses. Through large-scale natural-language experiments across languages, architectures, and context lengths, it reveals a consistent FW AoT where forward losses are lower than backward ones, with magnitude scaling with context length and model size. It complements these results with synthetic computability-theoretic constructions (e.g.,

and linear sparse circuits) that illustrate how information-preserving but hard-to-invert mappings can produce AoT due to sparsity and computational complexity. The work argues that AoT reflects fundamental long-range structure in natural language data and provides a framework linking causality, information theory, and complexity to learnability differences under time reversal, with potential implications for understanding language structure and guiding model design. Key implications include the universality of FW AoT across languages and models, the role of long-range dependencies in AoT magnitudes, and the possibility of connecting AoT to broader notions of complexity and irreversibility in data-driven learning.

Abstract

Paper Structure (45 sections, 4 equations, 12 figures, 9 tables)

This paper contains 45 sections, 4 equations, 12 figures, 9 tables.

Introduction
Autoregressive LLMs
Arrow of Time and Language Models
Cross-Entropy Loss and Perplexity
Setup and Plan
Relation to Previous Works in Language Modeling
Causality and Information Theory
Empirical Results on Natural Language
Setup
Dataset and Tokenization
Models, Hyperparameters and Training
Results
Arrow of Time in English and French
Context Window Size
Model Size
...and 30 more sections

Figures (12)

Figure 1: English vs French validation losses (French training losses in the zoom-in, early loss values cropped for readability).
Figure 2: BW/FW losses percentage difference for different context lengths
Figure 3: Validation loss curves for FW and BW models during training. Consistently, the BW loss is higher than its FW counterpart. This persists through the warm restart of the learning rate, which causes a bump in the loss.
Figure 4: Models loss at the end of training vs $f^{\rightarrow}$ sparsity.
Figure 5: Validation loss for two epochs of training on the greek dataset, for forward and backward models.
...and 7 more figures

Theorems & Definitions (7)

Remark 2
Remark 3
Remark 4
Remark 5
Remark 6
Remark 7
Claim 8

Arrows of Time for Large Language Models

TL;DR

Abstract

Arrows of Time for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (7)