Table of Contents
Fetching ...

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu

TL;DR

This paper reframes Vision-Language-Action (VLA) models from an MDP to a POMDP perspective by introducing a recurrent belief state and an Active Visual Attention (AVA) module that dynamically weights visual tokens based on history. The AVA-VLA framework, built on a OpenVLA-OFT foundation, conditions action generation on both current observations and the learned recurrent state, enabling active perception across time. Empirical results show state-of-the-art performance on LIBERO and CALVIN benchmarks and robust sim-to-real transfer on a real dual-arm robot, with extensive ablations validating the contributions of the AVA module and the recurrent initialization. These findings highlight the practical benefits of history-aware visual processing for complex, long-horizon robotic manipulation tasks.

Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

TL;DR

This paper reframes Vision-Language-Action (VLA) models from an MDP to a POMDP perspective by introducing a recurrent belief state and an Active Visual Attention (AVA) module that dynamically weights visual tokens based on history. The AVA-VLA framework, built on a OpenVLA-OFT foundation, conditions action generation on both current observations and the learned recurrent state, enabling active perception across time. Empirical results show state-of-the-art performance on LIBERO and CALVIN benchmarks and robust sim-to-real transfer on a real dual-arm robot, with extensive ablations validating the contributions of the AVA module and the recurrent initialization. These findings highlight the practical benefits of history-aware visual processing for complex, long-horizon robotic manipulation tasks.

Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.

Paper Structure

This paper contains 19 sections, 14 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) Graphical models of the proposed AVA-VLA framework and vanilla VLAs. (b) The corresponding visualization of two viewpoints in executing "turn on the stove and put the moka pot on it", where the vanilla OpenVLA-OFT openvla-oft fails to locate "stove" switch according the time sequence.
  • Figure 2: Overview of the proposed AVA-VLA framework.
  • Figure 3: Comparison on the Mobile ALOHA real-robot experiments. Evaluation across four manipulation tasks, including (a) Pick and Place, (b) Sequenced Instruction Understanding, (c) Flexible Object Folding, (d) Dexterous Action. Left: Representative middle states for each task setup. Right: Task-specific success rates and cross-task averages for our method and baselines.
  • Figure 4: Visualization of proposed AVA-VLA’s manipulation process on four long-horizon real-world tasks, showing key execution stage observations.
  • Figure 5: Visual dynamics of the soft weights during the task "put both moka pots on the stove" in two viewpoints.
  • ...and 5 more figures