Table of Contents
Fetching ...

Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

Tuan Duong Trinh, Naveed Akhtar, Basim Azam

Abstract

Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone -- with all inputs left intact -- degrade a robot's physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~percentage points (pp) -- reaching $-$19.3~pp on goal-conditioned tasks and $-$45~pp on individual tasks -- whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within $\pm$4~pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures -- establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.

Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

Abstract

Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone -- with all inputs left intact -- degrade a robot's physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~percentage points (pp) -- reaching 19.3~pp on goal-conditioned tasks and 45~pp on individual tasks -- whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within 4~pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures -- establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.
Paper Structure (21 sections, 4 figures, 5 tables)

This paper contains 21 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of reasoning trace attacks on robotic manipulation VLAs. (a) An adversary intercepts the chain-of-thought text between the reasoning module and action decoder --- all visual inputs and task instructions remain clean. Corruptions are organized into three tiers of increasing attacker capability: Tier 1 (noise) = random_tokens, padding; Tier 2 (mechanical-semantic) = shuffled, entity_swap, negation_flip; Tier 3 (LLM-adaptive) = llm_adversarial. (b) Key finding: among seven corruption conditions evaluated across 40 LIBERO tabletop manipulation tasks, only entity-reference swapping causes significant degradation ($-8.3$ pp overall), while all others --- including an LLM-adversarial rewrite --- are negligible.
  • Figure 2: Corruption-to-failure heatmap on DeepThinkVLA. Each cell shows absolute SR and degradation from the 95.4% clean baseline. Only entity_swap (red row) causes substantial damage; all other conditions, including LLM-adversarial (Tier 3), are negligible. Mean across 3 seeds, 600 episodes per cell.
  • Figure 3: Dose-response curves for graded random token replacement. LIBERO-Goal shows clear monotonic degradation ($-16.5\,\text{pp}$ at 100%, Spearman $\rho = -0.95$, $p < 0.0001$); other suites are largely robust.
  • Figure 4: Cross-surface comparison on DeepThinkVLA: CoT corruption (blue) vs. instruction attack (red). Bars show SR change in pp from the clean baseline. Instruction-level entity swap is dramatically more potent ($-85\,\text{pp}$ on Goal), but CoT attacks leave all inputs clean --- making them invisible to input-validation defenses.