Table of Contents
Fetching ...

Reverse Language Model

Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

TL;DR

This work investigates reverse-time autoregression by introducing LEDOM/Ledom, the first purely reverse-trained autoregressive language models trained on 435B tokens at 2B and 7B scales. It demonstrates that reverse conditioning yields distinct reasoning pathways and broader output exploration, establishing Ledom as a potential foundational model with unique strengths and safety challenges. To harness these properties, the authors propose Reverse Reward, a posterior evaluation mechanism that uses Ledom to rerank forward-model outputs and improve multi-step reasoning, particularly in mathematics. Empirical results show that Reverse Reward consistently boosts performance on mathematical reasoning benchmarks across multiple base models and decoding strategies, highlighting the value of integrating forward and reverse generative signals. The work also discusses limitations, such as weaker performance on forward-oriented tasks like code generation, and emphasizes releasing models and data to spur further exploration of reverse modeling in NLP.

Abstract

We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.

Reverse Language Model

TL;DR

This work investigates reverse-time autoregression by introducing LEDOM/Ledom, the first purely reverse-trained autoregressive language models trained on 435B tokens at 2B and 7B scales. It demonstrates that reverse conditioning yields distinct reasoning pathways and broader output exploration, establishing Ledom as a potential foundational model with unique strengths and safety challenges. To harness these properties, the authors propose Reverse Reward, a posterior evaluation mechanism that uses Ledom to rerank forward-model outputs and improve multi-step reasoning, particularly in mathematics. Empirical results show that Reverse Reward consistently boosts performance on mathematical reasoning benchmarks across multiple base models and decoding strategies, highlighting the value of integrating forward and reverse generative signals. The work also discusses limitations, such as weaker performance on forward-oriented tasks like code generation, and emphasizes releasing models and data to spur further exploration of reverse modeling in NLP.

Abstract

We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.

Paper Structure

This paper contains 53 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The Reverse Language Model (RLM) is pretrained with previous token prediction, in contrast to the standard left-to-right prediction direction of Forward Language Models (FLMs).
  • Figure 2: Illustration of Reverse Reward to guide a multi-step reasoning process. The example illustrates how forward-generated thoughts and answers are scored and refined using Ledom's posterior evaluations, demonstrated for a query requiring two sequential thought stages. Darker shading of the blocks in the diagram corresponds to a higher Reverse Reward, indicating the preferred paths or components in the reasoning chain.
  • Figure 3: The performance of FLM with Reverse Reward over various sampling sizes.
  • Figure 4: Training loss curves comparing Ledom and FLM. The Ledom exhibits slower convergence and higher final loss, indicating greater uncertainty during reverse-temporal modeling.
  • Figure 5: An example case of reverse language model evaluation on GSM8K, which includes input and output that has been manually reversed for human readability, and the gold answer. Demonstrations for few-shot prompting are in magenta.
  • ...and 1 more figures