Table of Contents
Fetching ...

Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation

Fangyuan Wang, Peng Zhou, Jiaming Qi, Shipeng Lyu, David Navarro-Alarcon, Guodong Guo

TL;DR

This work introduces ThinkProprio, a proprioception-grounded VLA policy that tokenizes robot state into the VLM embedding space and fuses it with the instruction for early visual reasoning. By guiding visual token selection with both instruction and discretized proprioception, ThinkProprio achieves state-aware feature retention, enabling aggressive token reduction without sacrificing task performance. Empirical results on CALVIN and LIBERO show competitive or superior task success, especially in long-horizon settings, while delivering substantial latency and compute savings. The approach demonstrates that proprioceptive grounding can meaningfully shape perceptual reasoning in multimodal robotics, with practical benefits for real-time control and efficiency.

Abstract

Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.

Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation

TL;DR

This work introduces ThinkProprio, a proprioception-grounded VLA policy that tokenizes robot state into the VLM embedding space and fuses it with the instruction for early visual reasoning. By guiding visual token selection with both instruction and discretized proprioception, ThinkProprio achieves state-aware feature retention, enabling aggressive token reduction without sacrificing task performance. Empirical results on CALVIN and LIBERO show competitive or superior task success, especially in long-horizon settings, while delivering substantial latency and compute savings. The approach demonstrates that proprioceptive grounding can meaningfully shape perceptual reasoning in multimodal robotics, with practical benefits for real-time control and efficiency.

Abstract

Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set. Across CALVIN, LIBERO, and real-world manipulation, ThinkProprio matches or improves over strong baselines while reducing end-to-end inference latency over 50%.
Paper Structure (38 sections, 10 equations, 8 figures, 15 tables)

This paper contains 38 sections, 10 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: ThinkProprio tokenizes proprioception into the VLM space to guide early visual reasoning. This yields strong CALVIN/LIBERO performance with 15% of visual tokens and 58% lower latency than prior VLA policies.
  • Figure 2: Overview of ThinkProprio. Proprioception is text-tokenized and combined with the instruction to guide visual token selection via cross-attention, retaining only task-relevant patches alongside a global context token. The compact token set is processed by the VLM, and the action head attends to the resulting features through cross-attention.
  • Figure 3: Token retention across four timesteps for two tasks, shown with paired static and wrist-mounted gripper views. Heatmaps visualize token retention scores, and labels indicate whether the retained tokens primarily focus on objects, proprioception, or both. The overlay Sel in each frame reports the number of retained tokens out of available visual tokens at that timestep.
  • Figure 4: Recovery behavior on a challenging stacking task. The policy requires more than 300 steps to stack the pink block on the smaller red support, and the heatmaps show persistent attention on the object and gripper as it repeatedly corrects the placement.
  • Figure 6: Failure case analysis on Push Pink Block Right. We show sequential timesteps from a representative failed rollout, with paired views from the static camera and the wrist-mounted gripper camera. Heatmaps overlay the selector's token-retention scores (higher intensity indicates higher retention priority), illustrating where the model attends when deciding which visual evidence to preserve for action prediction. Despite repeated approaches, the end-effector does not maintain sustained contact with the pink block, leading to insufficient rightward displacement within the subtask horizon.
  • ...and 3 more figures