Table of Contents
Fetching ...

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

TL;DR

This work introduces SEED-Bench-R1, a three-level benchmark for multimodal video understanding to rigorously evaluate post-training RL methods, highlighting a gap where outcome-only GRPO improves accuracy but harms reasoning consistency. It then proposes GRPO-CARE, a consistency-aware RL framework that adds an adaptive consistency bonus via a slowly updated reference model and drops KL penalties, improving both correctness and interpretability. Empirical results show GRPO-CARE outperforms GRPO on SEED-Bench-R1 (notably +6.7% on Level-3 and +24.5% in consistency) and transfers robustly to a suite of general video understanding benchmarks. The approach offers a practical, post-training strategy to cultivate more coherent and grounded multimodal reasoning in LLMs.

Abstract

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

TL;DR

This work introduces SEED-Bench-R1, a three-level benchmark for multimodal video understanding to rigorously evaluate post-training RL methods, highlighting a gap where outcome-only GRPO improves accuracy but harms reasoning consistency. It then proposes GRPO-CARE, a consistency-aware RL framework that adds an adaptive consistency bonus via a slowly updated reference model and drops KL penalties, improving both correctness and interpretability. Empirical results show GRPO-CARE outperforms GRPO on SEED-Bench-R1 (notably +6.7% on Level-3 and +24.5% in consistency) and transfers robustly to a suite of general video understanding benchmarks. The approach offers a practical, post-training strategy to cultivate more coherent and grounded multimodal reasoning in LLMs.

Abstract

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

Paper Structure

This paper contains 10 sections, 1 equation, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) SEED-Bench-R1 (SB-R1) provides a systematic, three-level evaluation of post-training methods for MLLMs in video understanding, encompassing tasks that require both perception and reasoning to tackle complex real-world scenarios. (b) Our analysis identifies a key limitation of standard outcome-supervised GRPO: while it improves answer accuracy, it often compromises logical consistency between reasoning and answers. By introducing an adaptive, group-relative consistency bonus via reference-likelihood calibration, our GRPO-CARE achieves higher answer accuracy across all difficulty levels and improves interpretability, as reflected by increased consistency rates.
  • Figure 2: Case study of an L3 question from SEED-Bench-R1, showing a video of task progress, a final observation image, and attention maps (output-to-visual tokens). The SFT model tends to memorize reasoning patterns and exhibits perceptual hallucinations. The GRPO model attends more comprehensively to the highlighted key visual observation while lacking logical consistency in the generated content. The GRPO-CARE model further balances visual perception and logical reasoning.
  • Figure 3: GRPO-CARE uses a two-tier reward system: a base reward for answer correctness ($r^b_*$) and an adaptive consistency bonus ($r^c_*$). The consistency bonus is given to high-accuracy samples whose reasoning-to-answer likelihood—estimated by a slowly updated (EMA) reference model—is higher than that of their group peers, conditioned on the multimodal question. The total reward, the sum of base and consistency rewards, is then used to compute advantages for updating the online model.