Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding; Ziliang Qiu; Bolian Li; Ruqi Zhang

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

TL;DR

This work tackles the sparsity of self-correction signals in vision-language model reasoning by introducing Octopus, a rollout augmentation framework that recombines standard RL rollouts to generate dense, explicit self-correction demonstrations. A two-stage, response-masking training regime decouples self-correction from direct reasoning, mitigating conflicting signals and enabling joint improvement. Empirical results on seven benchmarks show Octopus-8B achieving state-of-the-art performance among open-source VLMs and substantially faster per-step training than baselines. The approach demonstrates that leveraging intrinsic contrastive signals within policy rollouts can yield efficient, controllable improvements in complex multimodal reasoning. Overall, Octopus provides a practical pathway to instill controllable self-correction in VLMs with reduced training cost and enhanced robustness.

Abstract

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

TL;DR

Abstract

training time per step.

Paper Structure (27 sections, 6 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 27 sections, 6 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
Reinforcement Learning with Verifiable Rewards
Definition of Self-Correction
Learning Self-Correction from Paired Rollouts
The Challenge: Self-Correction Signals Are Sparse
Correction-Specific Rollout Augmentation
Training Recipe
Cold-Start and Data Construction
Conflicts Between Direct Reasoning and Self-Correction in RL Training Objective
Response-Masking Strategy for Decoupled Learning
Experiments
Setup
Main Results
Ablation Study
...and 12 more sections

Figures (8)

Figure 1: Comparison of accuracy and training efficiency across different RL methods initialized on Qwen3-8B-VL-Instruct. Octopus achieves the best average accuracy across seven benchmarks while requiring substantially less rollout time.
Figure 2: The percentage of different correction behaviors during RL training with a self-correction–encouraging prompt.
Figure 3: Left: Octopus augmentation pairs responses before and after the <sc> token to explicitly construct effective self-correction examples (wrong $\rightarrow$ correct), increasing their count from 0 to 4. It also produces an equal number of positive and negative samples (4 each), balancing the advantage distribution within each training group. Right: Our two-stage RL pipeline. In Stage I, we decouple self-correction learning by applying masks and KL regularization to $o_1$. In Stage II, we selectively unmask $o_1$ only for samples with non-conflicting reward signals, while keeping it masked for the remaining samples.
Figure 4: Training dynamics of different methods. GSPO is initialized from the base $\pi_\theta$ and trained with standard RL. In-dis and Mixed Sampling are initialized from their corresponding SFT models and trained with Octopus RL strategy introduced in § \ref{['sec:train_rl']}.
Figure 5: Teaching self-correction with binary and shaped rewards. (a) Reward curves before and after self-correction under a binary reward setting, showing limited self-correction learning. (b) Reward curves with the shaped reward defined in Eq. \ref{['eq:hack_reward']}, highlighting the emergence of reward hacking.
...and 3 more figures

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

TL;DR

Abstract

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)