Table of Contents
Fetching ...

Arrows of Time for Large Language Models

Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

TL;DR

The paper investigates Arrow of Time (AoT) in autoregressive Large Language Models by contrasting forward next-token prediction with backward previous-token prediction, formalizing time-directional measures $\mathbb{P}_n^{\rightarrow}$ and $\mathbb{P}_n^{\leftarrow}$ and their cross-entropy losses. Through large-scale natural-language experiments across languages, architectures, and context lengths, it reveals a consistent FW AoT where forward losses are lower than backward ones, with magnitude scaling with context length and model size. It complements these results with synthetic computability-theoretic constructions (e.g., $p\times q\leftrightarrow\mathrm{rev}(pq)$ and linear sparse circuits) that illustrate how information-preserving but hard-to-invert mappings can produce AoT due to sparsity and computational complexity. The work argues that AoT reflects fundamental long-range structure in natural language data and provides a framework linking causality, information theory, and complexity to learnability differences under time reversal, with potential implications for understanding language structure and guiding model design. Key implications include the universality of FW AoT across languages and models, the role of long-range dependencies in AoT magnitudes, and the possibility of connecting AoT to broader notions of complexity and irreversibility in data-driven learning.

Abstract

We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.

Arrows of Time for Large Language Models

TL;DR

The paper investigates Arrow of Time (AoT) in autoregressive Large Language Models by contrasting forward next-token prediction with backward previous-token prediction, formalizing time-directional measures and and their cross-entropy losses. Through large-scale natural-language experiments across languages, architectures, and context lengths, it reveals a consistent FW AoT where forward losses are lower than backward ones, with magnitude scaling with context length and model size. It complements these results with synthetic computability-theoretic constructions (e.g., and linear sparse circuits) that illustrate how information-preserving but hard-to-invert mappings can produce AoT due to sparsity and computational complexity. The work argues that AoT reflects fundamental long-range structure in natural language data and provides a framework linking causality, information theory, and complexity to learnability differences under time reversal, with potential implications for understanding language structure and guiding model design. Key implications include the universality of FW AoT across languages and models, the role of long-range dependencies in AoT magnitudes, and the possibility of connecting AoT to broader notions of complexity and irreversibility in data-driven learning.

Abstract

We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.
Paper Structure (45 sections, 4 equations, 12 figures, 9 tables)

This paper contains 45 sections, 4 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: English vs French validation losses (French training losses in the zoom-in, early loss values cropped for readability).
  • Figure 2: BW/FW losses percentage difference for different context lengths
  • Figure 3: Validation loss curves for FW and BW models during training. Consistently, the BW loss is higher than its FW counterpart. This persists through the warm restart of the learning rate, which causes a bump in the loss.
  • Figure 4: Models loss at the end of training vs $f^{\rightarrow}$ sparsity.
  • Figure 5: Validation loss for two epochs of training on the greek dataset, for forward and backward models.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Claim 8