Table of Contents
Fetching ...

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

Haosheng Li, Weixin Mao, Zihan Lan, Hongwei Xiong, Hongan Wang, Chenyang Si, Ziwei Liu, Xiaoming Deng, Hua Chen

TL;DR

BFA++ is proposed, a dynamic token pruning framework designed specifically for VLA models that highlights that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

Abstract

Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the π0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

TL;DR

BFA++ is proposed, a dynamic token pruning framework designed specifically for VLA models that highlights that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

Abstract

Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the π0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.
Paper Structure (14 sections, 5 equations, 10 figures, 4 tables)

This paper contains 14 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The comparison between BFA++ and other methods. Compared to existing token pruning methods, our approach not only improves inference speed but also significantly increases the success rate.
  • Figure 2: The pipeline of BFA++. The left figure shows our plug-and-play module for token pruning, and the right figure further details it. After obtaining the offline annotated inter-view importance $S^{gt}_{inter}$ and intra-view importance $S^{gt}_{intra}$, we train the two importance predictor, and predict these two importance scores ($S_{inter}$, $S_{intra}$). Based on the predicted intra-view importance, we perform local pruning on the tokens. Then, based on both inter-view importance and intra-view importance, we conduct global pruning to obtain the final pruned tokens.
  • Figure 3: Analyzing the importance of different viewpoints. We remove one or two view images to see how the performance of a trained $\pi_0$ model is affected. This experiment demonstrates that the importance of different viewpoints varies across different stages of operation.
  • Figure 4: The annotation system of BFA++. For inter-view importance, we provide three optional methods (VLM, manual annotation, bounding boxes to detect overlaps). For intra-view importance, we use the Grounding-SAM method to identify task-related regions. Details of the VLM annotation method can be found in BFA bfa.
  • Figure 5: The local and global token prune in $\pi_0$+BFA++.
  • ...and 5 more figures