Table of Contents
Fetching ...

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen

TL;DR

Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution.

Abstract

Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

TL;DR

Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution.

Abstract

Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
Paper Structure (40 sections, 15 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 15 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: We present Self-Correcting VLA (SC-VLA), a novel framework designed to enhance physical grounding through intrinsic self-improvement. The model is equipped with Sparse World Imagination (SPI) to forecast task progress and future trajectory trends, and Online Action Refinement (OAR) to dynamically optimize policies via residual adjustments and reshaped rewards. SC-VLA achieves superior performance on ManiSkill and real-world ARX5 benchmarks, surpassing baselines in both success rate and execution throughput.
  • Figure 2: The architecture of Self-Correcting VLA. The framework consists of two stages: Stage I (Top) utilizes a VLM and DiT-based backbone to generate base actions and sparse world imagination, decoded from the final output (Layer $N$) and intermediate features (Layer $M$), respectively. Stage II (Bottom) implements Online Action Refinement, where a Residual RL Module optimizes the final action by learning a residual term. This process is guided by endogenous dense rewards derived from the dynamic weighting of imagination consistency (Progress and $\Delta$State ) without external supervision.
  • Figure 3: Hardware platforms and visualizations of sampled tasks. The setup is equipped with a wristed camera and a third-person camera.
  • Figure 4: Ablation study on the effectiveness of sparse world imagination rewards and dynamic weight scheduling. We visualize the performance curves starting from the main training phase, excluding the data collection and residual warm-up periods. See Appendix C for further details.
  • Figure 5: Schematic of the multi-stage training protocol and residual weight schedule. The training process is divided into three distinct phases to ensure stability: (1) Buffer Warm-up for gathering demonstrations, (2) Residual Injection for gradually introducing the RL policy, and (3) the Main Training Phase. The curve illustrates the evolution of the residual weight $\lambda$, transitioning from pure imitation to full residual learning.
  • ...and 2 more figures