PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong
TL;DR
PosA-VLA tackles unstable and imprecise actions in vision-language-action models by introducing pose-conditioned anchor attention that binds visual perception to the robot’s end-effector pose. It constructs task- and end-effector-centered attention anchors, supervised by a dual loss combining spatial attention and batch-wise contrastive learning, and couples them with a Flow Matching Transformer for efficient, smooth action generation. The method demonstrates superior generalization and efficiency across diverse real-world and simulated manipulation tasks, while remaining lightweight and free of external perception modules. This approach delivers robust, goal-directed manipulation with strong data efficiency and potential for open-vocabulary extensions in embodied AI.
Abstract
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
