Table of Contents
Fetching ...

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

TL;DR

This work tackles unfaithful reasoning in outcome-reward RL for multimodal reasoning by introducing Self-Consistency Sampling (SCS), which uses truncation–resampling and visual perturbations to generate multiple reasoning trajectories and compute a differentiable consistency reward. By integrating SCS with RLOO, GRPO, and REINFORCE++, the authors demonstrate up to $7.7$ percentage-point improvements in accuracy across six multimodal benchmarks and several model scales, while also increasing reasoning faithfulness by roughly 15%. The key contributions are the TR+VP-based consistency mechanism, a formalized consistency reward, and extensive ablations showing the components' necessity and robustness. The proposed approach offers a simple, generalizable, and computationally efficient remedy for improving reasoning fidelity in outcome-reward RL for MLLMs, with practical implications for enhancing multimodal reasoning in real-world settings.

Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

TL;DR

This work tackles unfaithful reasoning in outcome-reward RL for multimodal reasoning by introducing Self-Consistency Sampling (SCS), which uses truncation–resampling and visual perturbations to generate multiple reasoning trajectories and compute a differentiable consistency reward. By integrating SCS with RLOO, GRPO, and REINFORCE++, the authors demonstrate up to percentage-point improvements in accuracy across six multimodal benchmarks and several model scales, while also increasing reasoning faithfulness by roughly 15%. The key contributions are the TR+VP-based consistency mechanism, a formalized consistency reward, and extensive ablations showing the components' necessity and robustness. The proposed approach offers a simple, generalizable, and computationally efficient remedy for improving reasoning fidelity in outcome-reward RL for MLLMs, with practical implications for enhancing multimodal reasoning in real-world settings.

Abstract

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

Paper Structure

This paper contains 31 sections, 14 equations, 22 figures, 14 tables, 1 algorithm.

Figures (22)

  • Figure 1: Overview of our work. When applied to multiple-choice problems, traditional outcome reward-based RL methods that rely solely on accuracy-based rewards often lead to situations where the selected option is correct, but the reasoning process is flawed. Our method introduces an additional consistency reward, which significantly reduces the occurrence of such cases.
  • Figure 2: Illustration of unfaithful reasoning phenomenon. (a) Compared with open-ended questions, training in the multi-choice format yields smaller performance gains. (b) Examples of unfaithful reasoning generated by model on Multi-Choice QA problems. (c) The curve of the relationship between the average number of final options for each question and the trajectory truncation ratio(%). (d) Correct reasoning trajectories generated by models with Open-Ended QA form.
  • Figure 3: Pipeline of our method. (a) illustrates the initial reasoning trajectory generated by the MLLM; (b) shows the sampling and probability propagation across reasoning steps.
  • Figure 4: Comparison of model response of different models. (a) Question image selected from M3CoT. (b) Reasoning trajectory of Qwen2.5-VL-Instruct trained by RLOO as baseline. (c) Reasoning trajectory of Qwen2.5-VL-Instruct trained with SCS. (d) Reasoning trajectory of Qwen2.5VL-Instruct trained by SFT. The red text is incorrect reasoning part.
  • Figure 5: Hyper-parameter sensitivity ablation. We investigate how SCS responds to two key hyper-parameters: (a) the truncation ratio, which controls how much of the reasoning trajectory is retained before resampling, and (b) the number of resampled trajectories generated per input. Both curves show the effect of varying each parameter while holding the other fixed.
  • ...and 17 more figures