Table of Contents
Fetching ...

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, Gao Huang

TL;DR

The paper addresses the problem of training large language models to reason effectively under reinforcement learning with verifiable rewards (RLVR), where existing objectives often entangle exploration and consolidation. It introduces Thickening-to-Thinning (T2T), a lightweight, human-inspired reward shaping scheme that treats longer, exploratory reasoning on difficult problems (thickening) separately from concise, correct solutions once mastery is achieved (thinning). The method defines a competence-conditioned, length-aware reward that uses on-policy estimates to adaptively modulate trajectory length via a quadratic dependence on the model's current success probability, improving exploration when uncertain and promoting efficiency after correctness. Empirical results on math benchmarks (MATH-500, AIME, AMC) across model families (Qwen and DeepSeek variants) show that T2T yields competitive or superior performance, enhances training dynamics by maintaining higher policy entropy, and yields robust gains particularly on larger models, while ablations confirm the necessity of both thickening and thinning components. The work demonstrates that mimicking human learning dynamics—expand, then compress—can improve reasoning capabilities under finite compute and offers a practical, integration-friendly enhancement to RLVR pipelines with potential broad impact on verifiable problem solving in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

TL;DR

The paper addresses the problem of training large language models to reason effectively under reinforcement learning with verifiable rewards (RLVR), where existing objectives often entangle exploration and consolidation. It introduces Thickening-to-Thinning (T2T), a lightweight, human-inspired reward shaping scheme that treats longer, exploratory reasoning on difficult problems (thickening) separately from concise, correct solutions once mastery is achieved (thinning). The method defines a competence-conditioned, length-aware reward that uses on-policy estimates to adaptively modulate trajectory length via a quadratic dependence on the model's current success probability, improving exploration when uncertain and promoting efficiency after correctness. Empirical results on math benchmarks (MATH-500, AIME, AMC) across model families (Qwen and DeepSeek variants) show that T2T yields competitive or superior performance, enhances training dynamics by maintaining higher policy entropy, and yields robust gains particularly on larger models, while ablations confirm the necessity of both thickening and thinning components. The work demonstrates that mimicking human learning dynamics—expand, then compress—can improve reasoning capabilities under finite compute and offers a practical, integration-friendly enhancement to RLVR pipelines with potential broad impact on verifiable problem solving in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
Paper Structure (58 sections, 60 equations, 11 figures, 11 tables)

This paper contains 58 sections, 60 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: A two-stage learning pattern in human learning, where understanding is first expanded through exploration and later refined into concise and efficient knowledge.
  • Figure 2: (1) On incorrect attempts, T2T incentivizes "thickening" to broaden the search space; (2) Upon correctness, it shifts to "thinning" to discourage redundancy, fostering model confidence.
  • Figure 3: Training Accuracy Evolution. The plot on the left corresponds to Qwen2.5-3B, and the plot on the right corresponds to Qwen3-4B. Across both model scales, T2T demonstrates superior learning efficiency compared to the baseline.
  • Figure 4: Policy Entropy Evolution. The plot on the left corresponds to Qwen2.5-3B, and the plot on the right corresponds to Qwen3-4B. Regardless of the absolute trend, T2T consistently maintains a higher relative entropy level than the baseline, indicating sustained exploration capabilities.
  • Figure 5: Response Length Evolution. The plot on the left corresponds to Qwen2.5-3B, and the plot on the right corresponds to Qwen3-4B.
  • ...and 6 more figures