Table of Contents
Fetching ...

SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, Jian Tang

TL;DR

SwitchVLA tackles dynamic task switching in Vision-Language-Action robots by introducing execution-aware conditioning through contact state and behavior modes, enabling smooth forward, rollback, and advance actions without external planners or extra demonstrations. The architecture comprises a VLC Embedding Module and a Conditional Execution Expert that jointly encode multimodal context and generate temporally coherent action chunks. Training uses behavior-specific supervision with forward/rollback/advance targets and diffusion-based flow-matching, while inference supports online re-planning as new instructions arrive. Extensive simulations and real-world robotic experiments demonstrate robust, instruction-adherent switching and improved generalization across multi-stage tasks, highlighting practical benefits for interactive, dynamic environments.

Abstract

Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitation hinders natural and robust interaction in dynamic settings, such as retail or household environments, where real-time intent changes are common. We propose SwitchVLA, a unified, execution-aware framework that enables smooth and reactive task switching without external planners or additional switch-specific data. We model task switching as a behavior modulation problem conditioned on execution state and instruction context. Expert demonstrations are segmented into temporally grounded contact phases, allowing the policy to infer task progress and adjust its behavior accordingly. A multi-behavior conditional policy is then trained to generate flexible action chunks under varying behavior modes through conditioned trajectory modeling. Experiments in both simulation and real-world robotic manipulation demonstrate that SwitchVLA enables robust instruction adherence, fluid task switching, and strong generalization-outperforming prior VLA baselines in both task success rate and interaction naturalness.

SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models

TL;DR

SwitchVLA tackles dynamic task switching in Vision-Language-Action robots by introducing execution-aware conditioning through contact state and behavior modes, enabling smooth forward, rollback, and advance actions without external planners or extra demonstrations. The architecture comprises a VLC Embedding Module and a Conditional Execution Expert that jointly encode multimodal context and generate temporally coherent action chunks. Training uses behavior-specific supervision with forward/rollback/advance targets and diffusion-based flow-matching, while inference supports online re-planning as new instructions arrive. Extensive simulations and real-world robotic experiments demonstrate robust, instruction-adherent switching and improved generalization across multi-stage tasks, highlighting practical benefits for interactive, dynamic environments.

Abstract

Robots deployed in dynamic environments must be able to not only follow diverse language instructions but flexibly adapt when user intent changes mid-execution. While recent Vision-Language-Action (VLA) models have advanced multi-task learning and instruction following, they typically assume static task intent, failing to respond when new instructions arrive during ongoing execution. This limitation hinders natural and robust interaction in dynamic settings, such as retail or household environments, where real-time intent changes are common. We propose SwitchVLA, a unified, execution-aware framework that enables smooth and reactive task switching without external planners or additional switch-specific data. We model task switching as a behavior modulation problem conditioned on execution state and instruction context. Expert demonstrations are segmented into temporally grounded contact phases, allowing the policy to infer task progress and adjust its behavior accordingly. A multi-behavior conditional policy is then trained to generate flexible action chunks under varying behavior modes through conditioned trajectory modeling. Experiments in both simulation and real-world robotic manipulation demonstrate that SwitchVLA enables robust instruction adherence, fluid task switching, and strong generalization-outperforming prior VLA baselines in both task success rate and interaction naturalness.

Paper Structure

This paper contains 34 sections, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Processes 1 and 2 show normal task execution. When the user changes their mind (e.g., "pick up lemon place on plate"), conventional VLA models cannot adjust its plan, leading to erratic behavior like oscillation or dropping items, as seen in Process 3. (b) A more natural response involves returning the previously held item (e.g., placing down the cookie box in Process 3 and then picking up the lemon and placing it on the plate in Processes 4 and 5).
  • Figure 2: Overview of SwitchVLA. The framework consists of the Vision-Language-Contact Embedding module and the Conditional Execution Expert, which jointly fuse multimodal inputs to generate execution-aware and conditionally controlled actions.
  • Figure 3: Identify and label time intervals of a specified event from trajectory data using a pre-trained VLM, such as GPT-4o. For example, with the prompt "Robot (gripper) in contact with object", the model retrieves and labels the contact time intervals within the trajectory.
  • Figure 4: Illustration of the training pipelines for forward, rollback, and advance behaviors. Dynamic task transitions are achieved through policy modulation based on the predicted behavior mode, allowing the system to adapt to changing instructions and feedback.
  • Figure 5: Top: Performance of $\pi_0$black2024pi_0 under pairwise task switching. (a), (b), and (c) each illustrate a unique task transition. Sudden switches during execution lead to erratic behaviors. Bottom: SwitchVLA enables smooth and consistent and instruction-aligned task transitions.
  • ...and 2 more figures