Table of Contents
Fetching ...

Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li

TL;DR

A two-stage framework for stable adaptive thinking in large reasoning models that applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization and adaptive reinforcement learning with Correctness-Preserving Advantage Shaping.

Abstract

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.

Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

TL;DR

A two-stage framework for stable adaptive thinking in large reasoning models that applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization and adaptive reinforcement learning with Correctness-Preserving Advantage Shaping.

Abstract

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
Paper Structure (49 sections, 7 equations, 7 figures, 3 tables)

This paper contains 49 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of efficient reasoning methods.
  • Figure 2: Overview of our two-stage training pipeline. Stage 1 performs Hybrid Fine-Tuning (HFT) on paired thinking and no-thinking formats to initialize a unified policy. Stage 2 applies GRPO-style reinforcement learning with correctness-preserving advantage shaping (CPAS) and length-aware gradient regulation (LAGR) to stabilize optimization under extreme length heterogeneity and to learn when to think.
  • Figure 3: Difficulty-aware mode selection and performance. (a) Performance of Two Think Mode on MATH-500, AIME-2024, and AIME-2025. (b) Mode ratio across MATH-500 difficulty levels. (c) Accuracy across difficulty levels, comparing our adaptive policy with always-Thinking and always-No-Thinking baselines.
  • Figure 4: Ablation and sensitivity analysis. (a) Training dynamics with/without CPAS: mean response length (left) and AIME-2024 accuracy (right). (b) Effect of the LAGR length-weight parameter $\beta$ (left) and the control-token boost factor $\lambda$ (right) on accuracy and the no-thinking ratio.
  • Figure 5: For an example from MATH-500, the Thinking baseline generates a long chain-of-thought with redundant intermediate steps. In contrast, our method chooses NoThinking and directly produces a concise final solution, using 713 tokens in total.
  • ...and 2 more figures