Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning
Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, Gao Huang
TL;DR
The paper addresses the problem of training large language models to reason effectively under reinforcement learning with verifiable rewards (RLVR), where existing objectives often entangle exploration and consolidation. It introduces Thickening-to-Thinning (T2T), a lightweight, human-inspired reward shaping scheme that treats longer, exploratory reasoning on difficult problems (thickening) separately from concise, correct solutions once mastery is achieved (thinning). The method defines a competence-conditioned, length-aware reward that uses on-policy estimates to adaptively modulate trajectory length via a quadratic dependence on the model's current success probability, improving exploration when uncertain and promoting efficiency after correctness. Empirical results on math benchmarks (MATH-500, AIME, AMC) across model families (Qwen and DeepSeek variants) show that T2T yields competitive or superior performance, enhances training dynamics by maintaining higher policy entropy, and yields robust gains particularly on larger models, while ablations confirm the necessity of both thickening and thinning components. The work demonstrates that mimicking human learning dynamics—expand, then compress—can improve reasoning capabilities under finite compute and offers a practical, integration-friendly enhancement to RLVR pipelines with potential broad impact on verifiable problem solving in LLMs.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
