Table of Contents
Fetching ...

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian

TL;DR

This work tackles the high computational burden of visual tokens in Vision-Language-Action (VLA) models for robotic manipulation by introducing Compressor-VLA, a hybrid, instruction-guided token compression framework. It jointly employs a Semantic Task Compressor (STC) for global, task-relevant context and a Spatial Refinement Compressor (SRC) for preserving fine-grained spatial details, with language instructions modulating both pathways via FiLM-like mechanisms. The approach yields a substantial efficiency gain, reducing FLOPs by $59\%$ and token counts by over $3\times$, while maintaining competitive task success on LIBERO and demonstrating strong sim-to-real transfer on a dual-arm robot; qualitative analyses corroborate instruction-driven perceptual focusing and complementary global-local processing. These results suggest that task-conditioned compression of visual tokens can enable more efficient, robust, and responsive robotic systems by aligning perceptual filtering with task goals. The findings are supported by extensive ablations and real-world experiments, underscoring the practical value of instruction-guided, dual-pathway compression in VLA models.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

TL;DR

This work tackles the high computational burden of visual tokens in Vision-Language-Action (VLA) models for robotic manipulation by introducing Compressor-VLA, a hybrid, instruction-guided token compression framework. It jointly employs a Semantic Task Compressor (STC) for global, task-relevant context and a Spatial Refinement Compressor (SRC) for preserving fine-grained spatial details, with language instructions modulating both pathways via FiLM-like mechanisms. The approach yields a substantial efficiency gain, reducing FLOPs by and token counts by over , while maintaining competitive task success on LIBERO and demonstrating strong sim-to-real transfer on a dual-arm robot; qualitative analyses corroborate instruction-driven perceptual focusing and complementary global-local processing. These results suggest that task-conditioned compression of visual tokens can enable more efficient, robust, and responsive robotic systems by aligning perceptual filtering with task goals. The findings are supported by extensive ablations and real-world experiments, underscoring the practical value of instruction-guided, dual-pathway compression in VLA models.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.

Paper Structure

This paper contains 27 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of three different visual information processing pipelines. (a) The standard VLA framework processes all visual tokens from the encoder. (b) Prior work, such as pruning-based methods, discards low-scoring tokens in a task-agnostic manner. (c) The proposed Compressor-VLA model performs instruction-guided compression via a dual-mechanism approach, reconstructing a compact set of tokens.
  • Figure 2: The architecture of the proposed Compressor-VLA. The module features two instruction-guided parallel pathways. The Semantic Task Compressor (STC) uses language to modulate its queries, while the Spatial Refinement Compressor (SRC) infuses language information directly into local visual tokens.
  • Figure 3: Execution examples on real-world tasks.
  • Figure 4: Instruction-conditioned attention visualization. The STC's module attention with the same initial scene and different language commands. Left: "put both the alphabet soup and the tomato sauce...". Right: "put both the cream cheese box and the butter...".
  • Figure 5: Visualization of the hybrid architecture's synergy across multiple tasks in LIBERO-10, including Task 5 ("put the white mug on the left plate..."), Task 7 ("put the white mug on the plate..."), and Task 10 ("put the yellow and white mug in the microwave...").