Table of Contents
Fetching ...

Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Asher J. Hancock, Allen Z. Ren, Anirudha Majumdar

TL;DR

Bring Your Own VLA (BYOVLA) is introduced: a run-time intervention scheme that dynamically identifies regions of the input image that the model is sensitive to, and minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools.

Abstract

Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/ .

Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

TL;DR

Bring Your Own VLA (BYOVLA) is introduced: a run-time intervention scheme that dynamically identifies regions of the input image that the model is sensitive to, and minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools.

Abstract

Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: https://aasherh.github.io/byovla/ .
Paper Structure (19 sections, 3 equations, 8 figures, 1 algorithm)

This paper contains 19 sections, 3 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: We introduce BYOVLA: a simple and lightweight run-time intervention scheme for improving the performance of an arbitrary VLA model in the presence of task-irrelevant distractions. Our method identifies task-irrelevant regions in the visual observation and minimally modifies regions that the model is sensitive to in order to reduce sensitivity to distractors.
  • Figure 2: First row: task success rates for BYOVLA with Octo on language instruction "place the carrot on yellow plate." Second row: kitchenette environment from BridgeV2 dataset with and without object and background distractions.
  • Figure 3: First column: heatmaps showing the regions each method deems the VLA is sensitive to. Second column: inpainted image regions with sensitivity threshold $\tau$. BYOVLA inpaints the the blue towel, orange, and donut, and then successfully grasps the carrot and puts it on the plate (last two columns), while BYOVLA$\setminus$Sens. additionally inpaints the green knife and cup, but fails the task. GradCAM fails to capture the model sensitivity to most irrelevant objects and thus also fails.
  • Figure 4: First column: task success rates for BYOVLA with OpenVLA on language instruction "put the eggplant in the pot." Second column: kitchenette environment from BridgeV2 dataset with distractions.
  • Figure 5: Left: GradCAM output at layer 6 with Octo on language instruction "place the carrot on yellow plate." Right: mask used to select what objects were attended to by keeping the top quarter of GradCAM values.
  • ...and 3 more figures