Table of Contents
Fetching ...

V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren

Abstract

Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Abstract

Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

Paper Structure

This paper contains 18 sections, 10 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparisons of different spatial selection strategies at $R=25\%$.
  • Figure 2: Spatio-temporal budgeting profiles. We visualize per-frame token budgets (higher means more retained tokens) on representative short, medium, and long videos. Uniform Allocationyang2025visionzip stays nearly flat and misses brief evidence peaks; segment-based pinelines shen2025fastvidshao2025holitom introduce boundary-sensitive budget jumps and waste tokens within redundant windows; global-uniqueness budgeting liu2025vidcom2 can over-concentrate on a few globally distinctive frames and under-cover transitional segments. Curvature-aware budgeting allocates more tokens to rapid semantic changes and event boundaries, improving spatiotemporal information coverage under tight budgets.
  • Figure 4: Overview of V-CAST. V-CAST formulates token pruning for VideoLLMs as an optimal semantic-trajectory approximation problem under a fixed budget. It applies Curvature-Guided Temporal Allocation to assign per-frame budgets by tracking semantic transitions, and then performs Dual-Anchor Spatial Token Selection to retain diverse and salient tokens within each frame while preserving their original on-grid coordinates.
  • Figure 5: Consistent gains with more frames. Performance trends on LongVideoBench, MLVU, VideoMME (Long), and EgoSchema as input frames increase. V-CAST improves accuracy and scales to longer inputs, while some baselines show limited gains or OOM failures at larger frame counts.
  • Figure 6: Efficiency comparison on LLaVA-OneVision-7B. We compare Vanilla, FastVID, VidCom$^2$, and V-CAST on Prefill Latency, Total Latency, and Peak Memory.
  • ...and 4 more figures