Table of Contents
Fetching ...

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, Ilija Bogunovic

TL;DR

The paper tackles the problem of unreliable cross-task performance in RL post-training of LLMs when using GRPO with uniform task weights. It introduces MT-GRPO, a robustness-aware framework that jointly optimizes policy updates and a dynamic, task-weighted sampling distribution, augmented by a ratio-preserving batch construction to ensure gradient signals reflect the adapted weights. The core contributions are: (i) improvement-aware task reweighting that balances weakest tasks with overall progress, (ii) a ratio-preserving sampler to align realized gradient contributions with task weights, and (iii) a scalable algorithm validated on 3-task and 9-task reasoning benchmarks showing substantial improvements in worst-task accuracy while maintaining competitive average accuracy. The results demonstrate that explicit optimization for task-wise robustness, together with principled batch construction, yields faster and more reliable multi-task reasoning across diverse benchmarks, with practical efficiency gains in training. This has significant implications for deploying general-purpose LLMs as reliable reasoners across heterogeneous tasks.

Abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

TL;DR

The paper tackles the problem of unreliable cross-task performance in RL post-training of LLMs when using GRPO with uniform task weights. It introduces MT-GRPO, a robustness-aware framework that jointly optimizes policy updates and a dynamic, task-weighted sampling distribution, augmented by a ratio-preserving batch construction to ensure gradient signals reflect the adapted weights. The core contributions are: (i) improvement-aware task reweighting that balances weakest tasks with overall progress, (ii) a ratio-preserving sampler to align realized gradient contributions with task weights, and (iii) a scalable algorithm validated on 3-task and 9-task reasoning benchmarks showing substantial improvements in worst-task accuracy while maintaining competitive average accuracy. The results demonstrate that explicit optimization for task-wise robustness, together with principled batch construction, yields faster and more reliable multi-task reasoning across diverse benchmarks, with practical efficiency gains in training. This has significant implications for deploying general-purpose LLMs as reliable reasoners across heterogeneous tasks.

Abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
Paper Structure (29 sections, 49 equations, 13 figures, 2 algorithms)

This paper contains 29 sections, 49 equations, 13 figures, 2 algorithms.

Figures (13)

  • Figure 1: GRPO assigns uniform task weights and samples without regard to task difficulty or zero-gradient rates. Consequently, easy tasks (Countdown) dominate while harder tasks (ARC, Zebra) lag, and effective gradient flow is skewed by varying zero-gradient rates ( $\otimes$ marks high zero-gradient rates). In contrast, MT-GRPO adapts task weights to prioritize weaker tasks and uses a ratio-preserving sampler to align effective gradient contributions with target weights, substantially improving ARC and Zebra and yielding more balanced performance.
  • Figure 2: In strict worst-task optimization ($\varepsilon=0$), task weights rapidly collapse to the current worst task and oscillate as the worst task shifts, resulting in near-zero weighting of Countdown.
  • Figure 3: Improvement-aware Weight Update (IWU)
  • Figure 4: Ratios of zero-gradient prompts across tasks observed during training. ARC exhibits a substantially higher proportion of zero-gradient prompts than Zebra.
  • Figure 5: Experiment 1: MT-GRPO substantially outperforms all baselines in terms of worst-task accuracy by $6\%$ or more without conceding on average accuracy. Moreover, it achieves higher average per-task relative change, reflecting stronger improvements on weaker tasks.
  • ...and 8 more figures