Table of Contents
Fetching ...

Reverse Thinking Makes LLMs Stronger Reasoners

Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister

TL;DR

RevThink introduces reverse thinking for LLMs by augmenting training data with forward reasoning, backward questions, and backward reasoning generated by a teacher. A three-objective multi-task learning regime trains a smaller student to perform forward reasoning while internalizing backward reasoning capabilities, keeping test-time cost equivalent to zero-shot inference. Across 12 diverse datasets, RevThink yields substantial gains over zero-shot and common distillation baselines, demonstrates sample efficiency, and shows strong generalization to out-of-distribution tasks. The approach scales with model size and complements existing data augmentation methods, offering a practical path to more reliable and versatile reasoning in LLMs.

Abstract

Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.

Reverse Thinking Makes LLMs Stronger Reasoners

TL;DR

RevThink introduces reverse thinking for LLMs by augmenting training data with forward reasoning, backward questions, and backward reasoning generated by a teacher. A three-objective multi-task learning regime trains a smaller student to perform forward reasoning while internalizing backward reasoning capabilities, keeping test-time cost equivalent to zero-shot inference. Across 12 diverse datasets, RevThink yields substantial gains over zero-shot and common distillation baselines, demonstrates sample efficiency, and shows strong generalization to out-of-distribution tasks. The approach scales with model size and complements existing data augmentation methods, offering a practical path to more reliable and versatile reasoning in LLMs.

Abstract

Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.

Paper Structure

This paper contains 18 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between symbolic knowledge distillation (SKD) and our method. (1) the teacher model generates multiple reasoning chains for a given question, (2) SKD supervised fine-tunes on the correct reasoning chains, and (3) our method incorporates bidirectional reasoning, learning from both Q-to-A and A-to-Q using our multi-task objectives.
  • Figure 2: RevThink consists of two stages: (1) Data augmentation and (2) Student model learning. First, given a dataset $\mathcal{D} = \{(Q^{(i)}, A^{(i)})\}_{i=1}^n$, we augment it by prompting the teacher model to generate forward reasoning, backward question, and backward reasoning. We keep instances only with correct forward reasoning (validated by the ground truth) and consistent forward-backward reasoning (validated by the teacher model). This yields an augmented dataset $\mathcal{D}_\text{aug} = (Q^{(i)}, R^{(i)}_f, Q^{(i)}_b, R^{(i)}_b)_{i=1}^n$. Next, we train the student model with three objectives: $Q \rightarrow R_f$, $Q \rightarrow Q_b$ and $Q_b \rightarrow R_b$, enabling the student to reason in both directions during training. At test time, the student model performs only forward reasoning, making test-time compute as efficient as zero-shot prompting.
  • Figure 3: Comparison of RevThink and the SFT baseline with different sample sizes. Notably, RevThink shows sample efficiency by largely outperforming SFT given any portion of the training data. Furthermore, our method with only 10% of training data outperforms SFT with the full training data on StrategyQA.
  • Figure 4: Comparison of different learning sources and the combination of input/output. We denote $X \rightarrow Y$ as "given $X$ as the input to generate $Y$". Also, we use $\&$ to denote simultaneous learning from different combinations. RevThink's learning to generate forward questions, backward questions and backward reasoning is the most effective, while only learning from generating backward reasoning is the least effective.
  • Figure 5: The average token counts per sample used in training versus the test-time accuracy. The dashed line shows the regression over the baselines. Our method outperforms the baselines with only a slight increase in token count. Note that RevThink generates a comparable number of tokens across all baselines at test time.
  • ...and 2 more figures