Table of Contents
Fetching ...

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding

TL;DR

Vision-Language-Action models often fail to faithfully follow language due to vision-driven shortcuts. The authors introduce LIBERO-CF, a counterfactual benchmark, and Counterfactual Action Guidance (CAG), a plug-in dual-branch inference that fuses a language-conditioned VLA with a vision-only VA prior to sharpen language influence at inference. Across simulation and real-world experiments, LIBERO-CF reveals pervasive counterfactual failures, and CAG yields consistent improvements in language grounding and task success on under-observed and OOD tasks without modifying model architectures. The results demonstrate that CAG reduces biased, vision-driven behavior while preserving in-domain performance, suggesting a practical path to more robust, instruction-faithful VLAs in robotics.

Abstract

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

TL;DR

Vision-Language-Action models often fail to faithfully follow language due to vision-driven shortcuts. The authors introduce LIBERO-CF, a counterfactual benchmark, and Counterfactual Action Guidance (CAG), a plug-in dual-branch inference that fuses a language-conditioned VLA with a vision-only VA prior to sharpen language influence at inference. Across simulation and real-world experiments, LIBERO-CF reveals pervasive counterfactual failures, and CAG yields consistent improvements in language grounding and task success on under-observed and OOD tasks without modifying model architectures. The results demonstrate that CAG reduces biased, vision-driven behavior while preserving in-domain performance, suggesting a practical path to more robust, instruction-faithful VLAs in robotics.

Abstract

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
Paper Structure (32 sections, 13 equations, 13 figures, 7 tables)

This paper contains 32 sections, 13 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Overview. (a) Vision-Language-Action Models (VLAs) often suffer from counterfactual failures due to vision shortcuts, defaulting to well-learned scene-specific behaviors instead of faithfully following instructions. (b) We study this issue and introduce LIBERO-CF, the first counterfactual benchmark for evaluating language following in VLAs. (c) We propose Counterfactual Action Guidance (CAG), a dual-branch inference scheme that mitigates counterfactual failures in VLAs. (d) Extensive experiments in both simulation and real-world experiments demonstrate the effectiveness of CAG across diverse VLAs.
  • Figure 2: Evidence for Vision Shortcuts in VLAs. (a) We visualize the distribution of grasp positions from 50 trials as heatmaps under different instructions. Even when given counterfactual or empty instructions, VLAs tend to execute the well-learned training task in the scene. (b) Removing the training-task object in the scene improves the success rates of VLAs on counterfactual instructions.
  • Figure 3: Method. We propose Counterfactual Action Guidance (CAG), a dual-branch inference scheme that enhances language conditioning by combining a VLA policy with a language-unconditioned Vision-Action (VA) branch.
  • Figure 4: Investigation of guidance scale. Increasing the guidance scale strengthens language conditioning and improves grounding accuracy. However, overly large scales degrade task success due to over-guidance.
  • Figure 5: Real-world experiments. We study multiple aspects of language grounding in real-world evaluations, including object recognition, spatial reasoning, goal execution, and out-of-distribution generalization. CAG consistently reduces counterfactual failures and improves task success across all scenes.
  • ...and 8 more figures