Table of Contents
Fetching ...

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng

TL;DR

Wavelet-based Frame Selection by Detecting Semantic Boundary by employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames is introduced, a training-free framework that presents a new perspective on effective video understanding.

Abstract

Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

TL;DR

Wavelet-based Frame Selection by Detecting Semantic Boundary by employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames is introduced, a training-free framework that presents a new perspective on effective video understanding.

Abstract

Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods.
Paper Structure (29 sections, 12 equations, 8 figures, 12 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: A comparison of frame selection strategies, illustrated with the query: "What is the process of applying makeup around the eyes in the video?". (Top) Current approach: Selects scattered, high-relevance frames (e.g., any frame with eyes), which fails to preserve the procedural order. (Bottom) Our approach: Employs a wavelet-based method to first segment the video into semantically coherent segments (e.g., applying eyeliner, shaping eyebrows) and then samples frames from each segment. This preserves the semantic integrity essential for process comprehension.
  • Figure 2: An overview of our proposed WFS-SB framework. The process unfolds in three main stages: (1) Wavelet-based Semantic Boundary Identification: The raw query-frame relevance signal is decomposed using a multi-level Discrete Wavelet Transform. We isolate and reconstruct the coarsest detail coefficients to generate a robust semantic change signal, whose peaks define the boundaries of coherent semantic segments. (2) Adaptive Budget Allocation: A composite importance score is computed for each segment, which guides a softmax-weighted distribution of the total frame budget $K$ = 8. (3) Diversity-Aware Intra-segment Selection: Finally, a localized Maximal Marginal Relevance (MMR) selection is performed within each segment to ensure both relevance and diversity in the final keyframe set.
  • Figure 3: Performance comparison across different frame budgets K on VideoMME. WFS-SB consistently outperforms uniform sampling across all LVLMs and budget settings.
  • Figure 4: Visualization of our WFS-SB framework on a daily itinerary video. Our wavelet-based method first partitions the video into coherent segments by detecting robust semantic boundaries. Subsequently, after filtering segments based on importance scores, Maximal Marginal Relevance sampling is applied within each to select keyframes that preserve the essential narrative, correctly capturing the sequence of class, tutoring, and self-study.
  • Figure 5: Visualization of three temporal relevance signal characteristics.(a) Non-stationarity: ITM score statistics (mean, variance) change drastically over time, with high-relevance segments showing elevated mean and variance. (b) Multi-scale structure: Semantic segments span vastly different temporal scales, from rapid transitions (5-10 frames) to prolonged processes (160+ frames). (c) Low signal-to-noise ratio: The raw ITM signal (gray) is heavily corrupted by high-frequency noise (yellow envelope), making direct peak detection error-prone.
  • ...and 3 more figures