Reverse Language Model
Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan
TL;DR
This work investigates reverse-time autoregression by introducing LEDOM/Ledom, the first purely reverse-trained autoregressive language models trained on 435B tokens at 2B and 7B scales. It demonstrates that reverse conditioning yields distinct reasoning pathways and broader output exploration, establishing Ledom as a potential foundational model with unique strengths and safety challenges. To harness these properties, the authors propose Reverse Reward, a posterior evaluation mechanism that uses Ledom to rerank forward-model outputs and improve multi-step reasoning, particularly in mathematics. Empirical results show that Reverse Reward consistently boosts performance on mathematical reasoning benchmarks across multiple base models and decoding strategies, highlighting the value of integrating forward and reverse generative signals. The work also discusses limitations, such as weaker performance on forward-oriented tasks like code generation, and emphasizes releasing models and data to spur further exploration of reverse modeling in NLP.
Abstract
We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
