Table of Contents
Fetching ...

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson

Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

Paper Structure

This paper contains 37 sections, 31 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: (A) Prompt-only reflection can reduce top frontier LLM's performance on AIME24, indicating brittleness. (B) ThinkTwice compared with existing method families. (C) ThinkTwice addresses these gaps by sequentially training a shared model backbone—first solving, then reflecting—yielding significant gains (+5 points reasoning, +11 points refinement) on AIME with Qwen3-4B.
  • Figure 2: ThinkTwice at a glance.
  • Figure 3: Cross-model refinement evaluation (average pass@4, $\uparrow$). Rows denote the backbone reasoning model; columns denote the refinement model.
  • Figure 4: Training dynamics of refinement across checkpoints. The vertical dashed lines mark the best checkpoints. Left (a): transition metrics on the training set across checkpoints. Right (b): formatting and length metrics across checkpoints during training. Top: boxed-answer and final-answer marker rates; bottom: average response length for self-refinement on correct-only base solutions.
  • Figure 5: Training-time cost and dynamics of ThinkTwice compared with GRPO. * denoted best checkpoint step of models. (a) Mean reward. (b) Response length. (c) Wall-clock time per update. (d) Accumulated training time. (e) Within-checkpoint macro average benchmark accuracy. Solid orange denotes ThinkTwice base updates, solid blue denotes GRPO, and dashed orange denotes ThinkTwice refinement updates when applicable.
  • ...and 2 more figures