Table of Contents
Fetching ...

The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang

TL;DR

The paper addresses the high computational cost of vision-language-action (VLA) models by introducing LightVLA, a differentiable token pruning framework that adaptively selects informative visual tokens using parameter-free cross-attention queries and Gumbel-softmax. This approach reduces compute without sacrificing performance, achieving state-of-the-art results on the LIBERO benchmark with substantial FLOPs and latency reductions and a 2.6% SR improvement over the strong OpenVLA-OFT baseline. It also explores LightVLA*, a variant with learnable queries that can further enhance performance, particularly on long-horizon tasks. The work demonstrates that efficiency and performance in VLA systems can be jointly optimized, enabling more practical real-time robotic applications and guiding future research on scalable embodied AI.

Abstract

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

TL;DR

The paper addresses the high computational cost of vision-language-action (VLA) models by introducing LightVLA, a differentiable token pruning framework that adaptively selects informative visual tokens using parameter-free cross-attention queries and Gumbel-softmax. This approach reduces compute without sacrificing performance, achieving state-of-the-art results on the LIBERO benchmark with substantial FLOPs and latency reductions and a 2.6% SR improvement over the strong OpenVLA-OFT baseline. It also explores LightVLA*, a variant with learnable queries that can further enhance performance, particularly on long-horizon tasks. The work demonstrates that efficiency and performance in VLA systems can be jointly optimized, enabling more practical real-time robotic applications and guiding future research on scalable embodied AI.

Abstract

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: LightVLA achieves better performance than common VLA models and acceleration methods with fewer visual tokens, yielding efficient computation and lower latency.
  • Figure 2: Illustration of the proposed LightVLA framework. Gray regions indicate the use of Gumbel-softmax for differentiable token selection.
  • Figure 3: An example of LIBERO-Long task: 'Put both moka pots on the stove'. Each frame consists of 4 images. Upper left: The 3rd person view camera. Upper right: The wrist camera. Lower left: The 3rd person view camera with pruned tokens masked. Lower right: The wrist camera with pruned tokens masked.
  • Figure 4: Illustration of LightVLA$^*$ when pruning visual tokens at the vision encoder with the learnable query.
  • Figure 5: Illustration of LightVLA$^*$ when pruning visual tokens at the first decoder layer of LLM with the learnable query.