Table of Contents
Fetching ...

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Hongyu Yan, Qiwei Li, Jiaolong Yang, Yadong Mu

Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of ) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

Paper Structure

This paper contains 61 sections, 36 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 2: Overview of ProgressVLA. Conditioned on a language instruction and current observation, the diffusion policy first generates a candidate chunk of latent actions. A action-oriented world model then rolls out these actions within a pre-trained visual feature space to project future states, while a progress estimator assigns a completion score to the predicted outcomes. Finally, progress gradients are backpropagated through the world model as classifier guidance, steering the diffusion process toward actions that maximize task advancement.
  • Figure 3: Architecture of action dynamics oriented world model.
  • Figure 4: Reinforcement learning framework of ProgressVLA.
  • Figure 5: Illustration of the five tasks in real-world model deployment on an ARX robotic dual-arm.
  • Figure 6: Real-world scenarios used for investigating the generalization of progress estimator.
  • ...and 5 more figures