Table of Contents
Fetching ...

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He

TL;DR

This work targets long-horizon reasoning with Chain-of-Thought (CoT) prompts and the length generalization gaps of auto-regressive models. It introduces RELAY, a two-stage framework that aligns looped Transformer iterations with CoT steps: Stage I trains a looped model to produce CoT-aligned intermediate outputs, and Stage II uses this model to generate high-quality CoT data for longer problems to fine-tune an auto-regressive CoT model. Empirically, looped Transformers exhibit superior length generalization, and RELAY significantly boosts AR-CoT performance on extended inputs while providing reliable intermediate reasoning. The approach offers a practical bridge between looped and auto-regressive architectures for robust, long-horizon reasoning.

Abstract

Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning

TL;DR

This work targets long-horizon reasoning with Chain-of-Thought (CoT) prompts and the length generalization gaps of auto-regressive models. It introduces RELAY, a two-stage framework that aligns looped Transformer iterations with CoT steps: Stage I trains a looped model to produce CoT-aligned intermediate outputs, and Stage II uses this model to generate high-quality CoT data for longer problems to fine-tune an auto-regressive CoT model. Empirically, looped Transformers exhibit superior length generalization, and RELAY significantly boosts AR-CoT performance on extended inputs while providing reliable intermediate reasoning. The approach offers a practical bridge between looped and auto-regressive architectures for robust, long-horizon reasoning.

Abstract

Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.

Paper Structure

This paper contains 21 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Visualization of Chain-of-Thought (CoT) and looping process. As the complexity of problem increases, in the auto-regressive CoT model, the number of reasoning tokens escalates. In contrast, in the looped model, the number of iterations of the loop block increases.
  • Figure 2: Length generalization performance of looped Transformer versus auto-regressive CoT model on Arithmetic (train: $\leq15$, test: $[15, 25]$), Edit Distance (train: $\leq30$, test: $[30, 40]$), and Longest Increasing Subsequence (train: $\leq100$, test: $[100, 120]$).
  • Figure 3: Overview of the RELAY framework. Stage I (left): Training looped model with explicit CoT alignment, where each iteration of the looped model learns to predict corresponding Chain-of-Thought (CoT) steps. Stage II (right): Using the trained looped model to generate CoT chains for enhancing auto-regressive CoT models. The looped model generates high-quality CoT chains for complex problems (beyond training length), which are then used to fine-tune the auto-regressive model to improve its reasoning capabilities.
  • Figure 4: Performance comparison of different models on long reasoning problems across three tasks: Arithmetic, Edit Distance (ED), and Longest Increasing Subsequence (LIS).
  • Figure 5: Hit accuracy matrices for the LIS task with a problem length of 105 ($T = 11$ steps).
  • ...and 2 more figures