Table of Contents
Fetching ...

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, Yu Guan

TL;DR

SemVID is proposed, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles that achieves a strong accuracy-efficiency trade-off on VTG benchmarks.

Abstract

Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

TL;DR

SemVID is proposed, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles that achieves a strong accuracy-efficiency trade-off on VTG benchmarks.

Abstract

Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
Paper Structure (38 sections, 16 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 16 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between existing pruning objectives and SemVID for VTG. (a) Performance comparison between VTG and VideoQA tasks. (b) Diagnostics of pruning objectives on evidence retention and cross-frame connectivity. (c) VTG requires long-range evidence aggregation rather than a single informative frame. SemVID preserves both query-critical evidence and transition relays to connect evidence across frames.
  • Figure 2: Overview of SemVID semantic-oriented pruning. (a) Frame-level budget allocation: assigns per-frame token budgets by jointly considering query-frame relevance and inter-frame variation. Given per-frame budgets, SemVID then outputs three roles of tokens. (b) Object token: uses Maximal Marginal Relevance (MMR) to retain query-relevant yet diverse evidence. (c) Motion token: retain query-aligned transitions as relay nodes to bridge long-range evidence and preserve connectivity. (d) Context token: selects per-frame anchors by scene-level representativeness and saliency.
  • Figure 3: mIoUs on Charades-STA under different token retention ratios.
  • Figure 4: Position-ID trajectories before and after pruning. Pruning alters the trajectory relative to the original.
  • Figure 5: Visualization results. blue boxes denote object tokens, Red boxes indicate motion tokens, and yellow boxes represent context tokens.
  • ...and 2 more figures