Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study
Yongyu Mu, Jiali Zeng, Bei Li, Xinyan Guan, Fandong Meng, Jie Zhou, Tong Xiao, Jingbo Zhu
TL;DR
The paper investigates training dynamics of long-chain-of-thought reasoning models trained with scaling reinforcement learning, addressing how positive versus negative samples shape learning, data-inefficiency in group relative policy optimization, and instability in evaluation. It uses RL with verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) on a DeepScaleR-40K math dataset, introducing Neg-to-Pos training, Relative Length Reward (RLR), and Offline Sample Injection (OSI) to improve data efficiency and robustness. The authors find that positive samples primarily improve fitting to training data while negative samples boost generalization and robustness; positive samples are essential for convergence in zero-RL, whereas cold-start models can achieve strong performance with negative samples first. They also show that zero advantage is a practical issue in GRPO, propose RLR and OSI as remedies, and reveal that instability during generation is driven by problem uncertainty and can be exacerbated by greedy decoding, highlighting the need for multiple evaluation runs. Overall, the work offers actionable strategies for building more data-efficient, robust long-CoT reasoning systems, while noting capacity limits that can constrain benefits from offline data and the challenges of evaluating uncertain problems.
Abstract
Despite recent progress in training long-chain-of-thought reasoning models via scaling reinforcement learning (RL), its underlying training dynamics remain poorly understood, and several counterintuitive behaviors persist. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in scaling RL, revealing that positive samples mainly facilitate precise fitting to the training data, whereas negative samples significantly enhance generalization and robustness. Interestingly, while positive samples are essential for convergence in the zero-RL setting, training on negative samples alone suffices to attain strong reasoning performance and even better generalization in cold-start scenarios. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two strategies, including relative length rewards and offline sample injection, to leverage these data better and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that greedy decoding can distort evaluation by flipping the correctness of responses. Our code is available at: https://github.com/takagi97/Dissect-Long-Reason-Models.
