GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen; Yuying Ge; Rui Wang; Yixiao Ge; Junhao Cheng; Ying Shan; Xihui Liu

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

TL;DR

This work introduces SEED-Bench-R1, a three-level benchmark for multimodal video understanding to rigorously evaluate post-training RL methods, highlighting a gap where outcome-only GRPO improves accuracy but harms reasoning consistency. It then proposes GRPO-CARE, a consistency-aware RL framework that adds an adaptive consistency bonus via a slowly updated reference model and drops KL penalties, improving both correctness and interpretability. Empirical results show GRPO-CARE outperforms GRPO on SEED-Bench-R1 (notably +6.7% on Level-3 and +24.5% in consistency) and transfers robustly to a suite of general video understanding benchmarks. The approach offers a practical, post-training strategy to cultivate more coherent and grounded multimodal reasoning in LLMs.

Abstract

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

TL;DR

Abstract

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)