Learning Self-Correction in Vision-Language Models via Rollout Augmentation
Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang
TL;DR
This work tackles the sparsity of self-correction signals in vision-language model reasoning by introducing Octopus, a rollout augmentation framework that recombines standard RL rollouts to generate dense, explicit self-correction demonstrations. A two-stage, response-masking training regime decouples self-correction from direct reasoning, mitigating conflicting signals and enabling joint improvement. Empirical results on seven benchmarks show Octopus-8B achieving state-of-the-art performance among open-source VLMs and substantially faster per-step training than baselines. The approach demonstrates that leveraging intrinsic contrastive signals within policy rollouts can yield efficient, controllable improvements in complex multimodal reasoning. Overall, Octopus provides a practical pathway to instill controllable self-correction in VLMs with reduced training cost and enhanced robustness.
Abstract
Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
