Table of Contents
Fetching ...

GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong

TL;DR

GRPO-MA tackles three core GRPO challenges by introducing multi-answer generation per thought, reducing variance in thought advantages and decoupling thought–answer gradients. The method is theoretically grounded via the delta method, showing that increasing the number of answers per thought ($M$) lowers variance and stabilizes training, while increasing the number of thoughts ($K$) has a more limited effect. Empirically, GRPO-MA improves performance and training efficiency across math, code, and multimodal tasks, and demonstrates strong robustness in sparse-reward simulator tasks, with ablations highlighting the value of higher $M$ and high-quality thoughts. The approach remains compatible with existing stability and efficiency enhancements and offers practical gains for CoT reinforcement learning in diverse domains.

Abstract

Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

TL;DR

GRPO-MA tackles three core GRPO challenges by introducing multi-answer generation per thought, reducing variance in thought advantages and decoupling thought–answer gradients. The method is theoretically grounded via the delta method, showing that increasing the number of answers per thought () lowers variance and stabilizes training, while increasing the number of thoughts () has a more limited effect. Empirically, GRPO-MA improves performance and training efficiency across math, code, and multimodal tasks, and demonstrates strong robustness in sparse-reward simulator tasks, with ablations highlighting the value of higher and high-quality thoughts. The approach remains compatible with existing stability and efficiency enhancements and offers practical gains for CoT reinforcement learning in diverse domains.

Abstract

Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

Paper Structure

This paper contains 52 sections, 36 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The operational flow of advantage estimation in GRPO and GRPO-MA. In the baseline GRPO framework (top), the advantage is computed from a single thought–answer pair, inherently coupling the estimation of thought and answer advantages to a single reward signal. In contrast, GRPO-MA (bottom) extends this setting by sampling multiple answers for each thought. This design decouples the estimation of thought and answer advantages and leverages aggregated information from multiple reward signals, thereby yielding richer supervision and enabling more robust and stable estimation of thought-level advantages.
  • Figure 2: A case study comparing the baseline GRPO with our proposed GRPO-MA on a referring expression grounding task. The prompt is to locate the "purple bottled beverage". The baseline model, GRPO (T4A1), recognizes the target's existence but its reasoning is distracted by other salient objects (the snacks), leading to a failure in grounding. In contrast, our GRPO-MA (T4A4) correctly reasons about the scene's context, focuses on the target object held by the robotic arm, and successfully provides the precise bounding box. This demonstrates the superior robustness of GRPO-MA in complex scene understanding and reasoning.
  • Figure 3: Ablation Study on Trajectory Prediction While maintaining the number of thoughts $K=4$, we gradually increase the number of responses $M$ per thought from 1 to 8 (i.e., the number of responses is 4, 8, 12...32).
  • Figure 4: Case Study on Object Detection Green text indicates key reasoning content.
  • Figure 5: Case Study on Trajectory Prediction Green text indicates key reasoning content.
  • ...and 7 more figures