Table of Contents
Fetching ...

ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

Yingjie Xia, Tao Liu, Jinglei Shi, Qingsong Xie, Heng Guo, Jian Yang, Xi Wang

TL;DR

The paper tackles the high computational cost of pre-filling in Video LLMs by introducing ShaRP, a training-free token pruning framework that operates at shallow decoder layers. ShaRP combines segment-aware causal masking, positional bias calibration, and register token deduplication to overcome attention collapse, PE bias, and redundancy, enabling aggressive compression without retraining. It demonstrates strong performance across multiple video benchmarks and backbones, delivering substantial speedups while maintaining accuracy and compatibility with existing pruning methods. This work provides a practical paradigm for efficient VLLM inference in long-form video understanding tasks.

Abstract

Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

TL;DR

The paper tackles the high computational cost of pre-filling in Video LLMs by introducing ShaRP, a training-free token pruning framework that operates at shallow decoder layers. ShaRP combines segment-aware causal masking, positional bias calibration, and register token deduplication to overcome attention collapse, PE bias, and redundancy, enabling aggressive compression without retraining. It demonstrates strong performance across multiple video benchmarks and backbones, delivering substantial speedups while maintaining accuracy and compatibility with existing pruning methods. This work provides a practical paradigm for efficient VLLM inference in long-form video understanding tasks.

Abstract

Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

Paper Structure

This paper contains 14 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Left: ShaRP is a training-free, attention-based framework for inner-LMM token pruning in VLLMs. Right: ShaRP delivers superior performance across video understanding benchmarks.
  • Figure 2: Layer-wise Attention Scores. Shallow-layer attention exhibits strong positional bias, assigning higher scores to later tokens, whereas deep-layer attention is semantically guided. After debiasing, shallow-layer attention closely aligns with that of deep layers, enabling early and reliable pruning.
  • Figure 3: Pipeline of ShaRP.ShaRP performs token pruning in video LLMs through three sequential modules: (a) Segment-Aware Causal Masking (SegM) partitions video tokens into content-consistent segments and restricts attention within them; (b) Positional Bias Calibration (PosC) debiases shallow-layer attention scores, aligning them with semantically meaningful distributions; (c) Register Token Deduplication (RegD) refines pruned tokens by merging redundant ones and replenishing diverse representatives.
  • Figure 4: Original vs. Debiased Attention.Left: Averaged attention scores on VideoMME fu2025video at 20% token retention. After debiasing, attention is more evenly distributed across frames, aligning better with question-relevant regions. Right: Cross-attention between the last text token and video tokens. In the original setting, semantically important middle frames receive low attention scores, while after debiasing, attention correctly focuses on relevant segments with high scores.
  • Figure 5: Comparison and Ablation Visualization.Left: Comparison between ShaRP and prior attention-based pruning methods FastV chen2024image and Feather endo2025feather. Right: Visualization of ablation results illustrating the effect of each proposed component.