Table of Contents
Fetching ...

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee

Abstract

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Abstract

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
Paper Structure (25 sections, 5 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 5 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: (Left) Token pruning with our STTS (purple box) vs. a cosine-similarity-based heuristic. STTS learns that background patches are less important, while the heuristic prunes all tokens equally. (Right) QA performance under increasing vision token pruning ratios ($k\%$). STTS (pink squares) consistently demonstrates a flatter, more robust degradation curve compared to the Random baseline (blue circles) across all metrics.
  • Figure 2: Overall workflow of using STTS within the VLM. Numbered vision tokens here are 3x3 grids. After ViT layer $l$, STTS prunes vision tokens permanently from the entire architecture. We pad tokens during packing for ViT batch computation.
  • Figure 2: Comparison between different pruning methods using 50% pruning. With Random as the baseline, STTS outperforms Heuristic, especially on long videos.
  • Figure 3: Architectural and procedural overview of STTS. We use 9x9 tokens per frame for illustration. Vision features after ViT layer $l$ are first downsampled via pooling then scored. The scores are injected as attention bias for layer $l+1$ before the pruning algorithm is applied to allow for spatial pruning. The scores are also aligned with neighboring-frame per-patch cosine similarity for temporal pruning.
  • Figure 4: Visualization of the packing algorithm. (a) Before pruning, the scoring mechanism identifies the bottom-$k\%$ importance tokens (in dotted squares) to be removed. (b) To reduce tensor sparsity, the remaining tokens from Frame 2 (green) and Frame 4 (red) are consolidated into a single packed batch entry. Because Frame 1 is always untouched and Frame 3 retains high token counts, they remain independent.
  • ...and 3 more figures