Table of Contents
Fetching ...

PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong

TL;DR

PosA-VLA tackles unstable and imprecise actions in vision-language-action models by introducing pose-conditioned anchor attention that binds visual perception to the robot’s end-effector pose. It constructs task- and end-effector-centered attention anchors, supervised by a dual loss combining spatial attention and batch-wise contrastive learning, and couples them with a Flow Matching Transformer for efficient, smooth action generation. The method demonstrates superior generalization and efficiency across diverse real-world and simulated manipulation tasks, while remaining lightweight and free of external perception modules. This approach delivers robust, goal-directed manipulation with strong data efficiency and potential for open-vocabulary extensions in embodied AI.

Abstract

The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

TL;DR

PosA-VLA tackles unstable and imprecise actions in vision-language-action models by introducing pose-conditioned anchor attention that binds visual perception to the robot’s end-effector pose. It constructs task- and end-effector-centered attention anchors, supervised by a dual loss combining spatial attention and batch-wise contrastive learning, and couples them with a Flow Matching Transformer for efficient, smooth action generation. The method demonstrates superior generalization and efficiency across diverse real-world and simulated manipulation tasks, while remaining lightweight and free of external perception modules. This approach delivers robust, goal-directed manipulation with strong data efficiency and potential for open-vocabulary extensions in embodied AI.

Abstract

The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

Paper Structure

This paper contains 31 sections, 17 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Quantitative analysis of the grasping task (pick up the bread). Top: the initial scene (left) and the ground-truth grasping moment captured from human teleoperation (right). Bottom: distance between the robot end-effector and the ground-truth grasp point over time; the light-blue area denotes the successful grasping range. Our PosA-VLA reaches the grasping region faster and more accurately, while DexGraspVLA and $\pi_0$ eventually succeed but require longer execution time. In contrast, OpenVLA and Smol-VLA fail to reach the successful grasping range.
  • Figure 2: Overview of the proposed PosA-VLA framework. A CLIP text encoder extracts the textual feature, while a CLIP image encoder produces patch-wise visual features from head and wrist cameras. These features are fused through a cross-attention module to generate anchor attention weights, which are supervised by the proposed anchor loss using the ground-truth pose-conditioned anchor maps. The anchor attention weights are then applied to DINOv2 image features via element-wise multiplication to obtain refined visual representations. Finally, the refined visual features, together with the text feature and the robot state feature, are fed into the Flow Matching Transformer (FMT) to predict the continuous action sequence.
  • Figure 3: Visualization of attention behaviors with and without our anchor supervision. Columns (left to right): input image, pose-anchored attention weight ($\mathbf{M}_t$), average cross-attention of the last-layer heads in the action transformer, and overlay of the action attention on the original image. Top: baseline without anchor loss; bottom: our PosA-VLA with anchor loss, which produces sharper, more localized, and task-centered attention.
  • Figure 4: Experimental setup and evaluation environments. Left: the robotic platform. Right: representative testing environments used in our experiments, including Basic, Unseen Background, Unseen Lighting, Distractor Objects, Unseen Objects, and Long-horizon Task.
  • Figure 5: (a) Illustration of our data collection setup. Objects are randomly placed on a $5{\times}5$ grid, where the red object denotes the target to be grasped, and the yellow dots on the circle indicate the reference points for human placement. (b) Visualization of the overall training distribution obtained by projecting all grasp points from the demonstrations back onto the 2D image plane.
  • ...and 11 more figures