Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang; Weiye Xu; Aijun Yang; Wengang Zhou; Lewei Lu; Houqiang Li; Xiaohua Wang; Jinguo Zhu

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

TL;DR

This work tackles unfaithful reasoning in outcome-reward RL for multimodal reasoning by introducing Self-Consistency Sampling (SCS), which uses truncation–resampling and visual perturbations to generate multiple reasoning trajectories and compute a differentiable consistency reward. By integrating SCS with RLOO, GRPO, and REINFORCE++, the authors demonstrate up to $7.7$ percentage-point improvements in accuracy across six multimodal benchmarks and several model scales, while also increasing reasoning faithfulness by roughly 15%. The key contributions are the TR+VP-based consistency mechanism, a formalized consistency reward, and extensive ablations showing the components' necessity and robustness. The proposed approach offers a simple, generalizable, and computationally efficient remedy for improving reasoning fidelity in outcome-reward RL for MLLMs, with practical implications for enhancing multimodal reasoning in real-world settings.

Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

TL;DR

Abstract

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)