Table of Contents
Fetching ...

Shot-Aware Frame Sampling for Video Understanding

Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan, Shirin Jalali, Yong Cao

Abstract

Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.

Shot-Aware Frame Sampling for Video Understanding

Abstract

Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.
Paper Structure (24 sections, 14 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 14 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overall framework of our approach. We propose a task-agnostic and plug-and-play sampler, InfoShot, before the VLM to select keyframes under a fixed budget. InfoShot performs greedy shot segmentation in deep feature space (Sec. \ref{['subsec:shot_seg']}) and then selects two frames per shot: a common and a unique high-deviation frames (Sec. \ref{['subsec:dualframe']}). The downstream VLM uses the same prompt; only the sampled frame set differs.
  • Figure 2: Overview of the SynFlash generation pipeline. ImageNet-1k imagenet15russakovsky images are injected at random times into videos formed by concatenating three Panda-70M chen2024panda70m clips, creating four transient anomaly types with frame-level ground truth.
  • Figure 3: Case study from SynFlash (PiP). Left: sampled frames from different methods. Right: example questions.
  • Figure 4: Feature-space distortion $\mathrm{Dist}(\mathcal{K})$ (lower is better) on SynFlash.
  • Figure 5: Similarity matrices on SynFlash under different feature extractors. The first row uses HSV color-histogram features and the second row uses semantic features. Each column shows one example video from a SynFlash subset (Camouflage, High-Sat, Curtain, and PiP). Red arrows indicate the temporal injected flash event.
  • ...and 5 more figures