Table of Contents
Fetching ...

StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models

Kartikay Milind Pangaonkar, Prabin Rath, Omkar Patil, Nakul Gopalan

Abstract

Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples. Videos of StageCraft in effect can be found at https://stagecraft-decorator.github.io/stagecraft/ .

StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models

Abstract

Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples. Videos of StageCraft in effect can be found at https://stagecraft-decorator.github.io/stagecraft/ .
Paper Structure (17 sections, 2 equations, 8 figures)

This paper contains 17 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: We empirically evaluate StageCraft with two state-of-the-art VLA models, Pi0.5 intelligence2025pi and SmolVLA shukor2025smolvla, under challenging clutter and obstruction settings. In the setup shown, StageCraft identifies "pink pattern cloth" and "red write santa plush" as potential failure-inducing distractors for the stack_cups task. These objects are then detected and removed from the robot’s workspace via primitive pick-and-place actions, after which the policy is executed and successfully completes the task.
  • Figure 2: Real-world task domains used to evaluate StageCraft. The stack_cups and setup_plate tasks require precise manipulation across four subtasks, whereas block_in_bowl is simpler and primarily requires accurate gripper alignment with the block for successful completion.
  • Figure 3: Set of $8$ distractor objects used in our real-world experiments. The gray collector bin is also a distractor as it was not present in the robot's workspace during data collection.
  • Figure 4: Task success rates for real-world robot experiments with Pi0.5 and SmolVLA under the Base, Distractor, and StageCraft settings. StageCraft recovers baseline performance, bringing success rates closer to those observed in the Base setting.
  • Figure 5: Comparison of performance of three different VLM's for (a) prompt following accuracy, (b) task success rate and (c) number of corrective steps per rollout for the block_in_bowl task. Models with lower prompt following accuracy fail to take correct environment modification steps thus resulting in lower policy success rates.
  • ...and 3 more figures