Table of Contents
Fetching ...

Action Hallucination in Generative Visual-Language-Action Models

Harold Soh, Eugene Lim

TL;DR

Focusing on latent-variable generative policies, it is shown that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures, and three such barriers are studied -- topological, precision, and horizon -- and how they impose unavoidable tradeoffs.

Abstract

Robot Foundation Models such as Vision-Language-Action models are rapidly reshaping how robot policies are trained and deployed, replacing hand-designed planners with end-to-end generative action models. While these systems demonstrate impressive generalization, it remains unclear whether they fundamentally resolve the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers -- topological, precision, and horizon -- and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.

Action Hallucination in Generative Visual-Language-Action Models

TL;DR

Focusing on latent-variable generative policies, it is shown that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures, and three such barriers are studied -- topological, precision, and horizon -- and how they impose unavoidable tradeoffs.

Abstract

Robot Foundation Models such as Vision-Language-Action models are rapidly reshaping how robot policies are trained and deployed, replacing hand-designed planners with end-to-end generative action models. While these systems demonstrate impressive generalization, it remains unclear whether they fundamentally resolve the long-standing challenges of robotics. We address this question by analyzing action hallucinations that violate physical constraints and their extension to plan-level failures. Focusing on latent-variable generative policies, we show that hallucinations often arise from structural mismatches between feasible robot behavior and common model architectures. We study three such barriers -- topological, precision, and horizon -- and show how they impose unavoidable tradeoffs. Our analysis provides mechanistic explanations for reported empirical failures of generative robot policies and suggests principled directions for improving reliability and trustworthiness, without abandoning their expressive power.
Paper Structure (35 sections, 19 theorems, 101 equations, 3 figures)

This paper contains 35 sections, 19 theorems, 101 equations, 3 figures.

Key Result

Lemma 9

Suppose Assumption ass:disconnected holds and there exist $z_L,z_R \in \mathcal{Z}$ such that $\pi_\theta(s,z_L) \in U_L$ and $\pi_\theta(s,z_R) \in U_R$. Define the seam set, $\mathcal{Z}_{\mathrm{seam}}(s) := \{ z \in \mathcal{Z} : \pi_\theta(s,z) \in \mathcal{A}_{\mathrm{forb}}(s)\}$. Then $\math In other words, no continuous latent-to-action map that covers both safe modes can be hallucination

Figures (3)

  • Figure 1: (Left) The prototypical generative VLA analyzed in this work. Given state observations, a task prompt, and a noise sample, the model outputs robot actions. Recent VLAs are structured into a high-level planner and a low-level action head, but part of our theory also applies to those that do not have this explicit structure (e.g., Diffusion Policy chi2023diffusionpolicy, RDT liu2025rdtb). (Right) An example where a robot is given a long-horizon task that involves multi-modality and precision.
  • Figure 2: Topological barrier for latent-variable VLA policies. (a) We study generative VLAs whose action head is a conditional latent-variable policy $\pi_\theta(s,z)$ that maps a state (e.g., an image--language context) and latent noise $z$ to a continuous action (or action chunk). In the illustrated navigation example, reaching the microwave requires going left or right around the kitchen island, inducing two qualitatively distinct modes of valid behaviors. (b) Schematic of the topological barrier (Lemma \ref{['lem:topological']}). The valid actions (bottom panel) decompose into disconnected components $U_1$ and $U_2$ (e.g., left vs. right), separated by a forbidden region $\mathcal{A}_{\mathrm{forb}}$. If $\pi_\theta(s,\cdot)$ is continuous and maps $z\mapsto a\in U_1$ and $z'\mapsto a'\in U_2$, then any continuous latent path between $z$ and $z'$ must cross an open seam$\mathcal{Z}_{\mathrm{seam}}=\pi_\theta(s,\cdot)^{-1}(\mathcal{A}_{\mathrm{forb}})$, implying non-zero hallucination probability. Our lower bound scales with the number of modes and with $W/L$ (a gap-smoothness ratio). (c) Diffusion model trained on bimodal action data: red points in $\mathcal{Z}$ (top) lie on the seam and decode to forbidden actions in $\mathcal{A}$ (bottom). See Appendix \ref{['app:band-topology-exp']} for details. (d) Empirical trends for flow matching and diffusion. Hallucination rates $H$ increases approximately linearly with the number of modes $M$ (top) and grows with $W/\widehat{L}$ (bottom), consistent with Theorem \ref{['thm:isoperimetry']} ($\widehat{L}$ is the numerically-estimated Lipschitz constant).
  • Figure 3: Precision barrier for contact-rich tasks. (a) Many manipulation tasks (e.g., grasping, peg-in-hold, handling tools / articulated / deformable objects) require high precision in that valid actions concentrate near a lower-dimensional feasible set. We model this as a $k$-dimensional manifold $\mathcal{M}\subset\mathcal{A}$ with tolerance tube $\mathcal{M}_\delta=\{a:\mathrm{dist}(a,\mathcal{M})\le \delta\}$ (schematic). (b) Empirical distribution of distances $r=\mathrm{dist}(a,\mathcal{M})$ for samples from flow matching and diffusion (log scales). The shaded region $r>\delta$ corresponds to action hallucinations. See Appendix \ref{['app:prec_exps']} for experiment details. (c) Action hallucination rate $H(s;\delta)=\Pr[r>\delta]$ versus tolerance $\delta$ (log--log). Tightening tolerance sharply increases hallucinations, consistent with our precision barrier (Lemma \ref{['lem:density-barrier']}) that shows maintaining low hallucination at small $\delta$ requires increasingly concentrated mass near $\mathcal{M}$. (d) The geometric mean of per-step minimum singular values, $(\prod_{t=1}^{K}\sigma_{\min}(J\Phi_t))^{1/K}$, increases toward $1$ as the number of sampler steps $K$ grows, indicating that the necessary overall contraction can be distributed across many mild refinement steps rather than a single severe collapse (Theorem \ref{['thm:precision-trilemma']} and Corollary \ref{['cor:k-step-tradeoff']}).

Theorems & Definitions (40)

  • Definition 1: Environment
  • Definition 2: Goal-Reaching Task Instance
  • Definition 4: Latent-Head Policy
  • Definition 5: Closed-Loop Rollout and Induced Plan
  • Definition 6: Physical Validity Oracle
  • Definition 7: Action Hallucination
  • Definition 8: Plan Hallucination
  • Lemma 9: Topological Barrier
  • Theorem 11: Isoperimetric lower bound on action hallucination
  • Definition 12: Contact/Precision Manifold and Tolerance
  • ...and 30 more