Table of Contents
Fetching ...

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan

TL;DR

SRPO tackles the challenge of explicit self-reflection in multimodal LLMs with a two-stage training pipeline that first injects reflection-enabled reasoning via reflection-oriented SFT and then reinforces it with a reflection-aware GRPO-based RL objective. The reflection-focused data for cold-start initialization and a tailored reward design that rewards concise, meaningful reflections drive improved reasoning accuracy and reflection quality across diverse multimodal benchmarks. Empirical evaluations on MathVista, MathVerse, MathVision, and MMMU-Pro show SRPO achieving state-of-the-art results among open-source models and competitive performance relative to closed-source systems, with strong cross-domain generalization. The work demonstrates that explicitly incorporating self-reflection into both supervised and reinforcement learning stages can push multimodal models beyond pre-training cognitive boundaries.

Abstract

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

TL;DR

SRPO tackles the challenge of explicit self-reflection in multimodal LLMs with a two-stage training pipeline that first injects reflection-enabled reasoning via reflection-oriented SFT and then reinforces it with a reflection-aware GRPO-based RL objective. The reflection-focused data for cold-start initialization and a tailored reward design that rewards concise, meaningful reflections drive improved reasoning accuracy and reflection quality across diverse multimodal benchmarks. Empirical evaluations on MathVista, MathVerse, MathVision, and MMMU-Pro show SRPO achieving state-of-the-art results among open-source models and competitive performance relative to closed-source systems, with strong cross-domain generalization. The work demonstrates that explicitly incorporating self-reflection into both supervised and reinforcement learning stages can push multimodal models beyond pre-training cognitive boundaries.

Abstract

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

Paper Structure

This paper contains 28 sections, 15 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Left: Illustrative examples of reflection improving reasoning. Right: Quantitative comparison on benchmark datasets.
  • Figure 2: Pipeline of Self-Reflection SFT data construction, including CoT and self-reflection generation.
  • Figure 3: Generated samples in RL training (left) and generated samples in real test case (right).
  • Figure 4: Training curves for SRPO and baselines: (a) training reward, (b) response length, and (c) upper clipping ratio.
  • Figure 5: Performance of various RL methods with and without self-reflection.
  • ...and 2 more figures