Table of Contents
Fetching ...

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Junho Lee, Jeongwoo Shin, Seung Woo Ko, Seongsu Ha, Joonseok Lee

TL;DR

The paper tackles scalable frame sampling for video classification by introducing a semi-optimal policy that reduces the search space from $O(T^N)$ to $O(T)$ under a frame-independence assumption observed in practical frame rates. It presents SOSampler, which learns this policy by distilling per-frame classifier confidence through a pairwise ranking loss and a label-guidance loss, enabling effective selection of $N$ frames from $T$ candidate frames. Extensive experiments across ActivityNet-v1.3, Mini-Kinetics, Mini-Sports1M, and COIN with CNN and Transformer backbones show that the semi-optimal approach yields stable, high performance for both small and large values of $N$ and $T$, often outperforming methods that search the full combinatorial space. The proposed method also demonstrates improved computational efficiency, achieving higher throughput with lower GFLOPs. Overall, the work shifts the focus from exploring large search spaces to exploiting a principled independence-based scoring to achieve scalable, accurate video classification.

Abstract

Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

TL;DR

The paper tackles scalable frame sampling for video classification by introducing a semi-optimal policy that reduces the search space from to under a frame-independence assumption observed in practical frame rates. It presents SOSampler, which learns this policy by distilling per-frame classifier confidence through a pairwise ranking loss and a label-guidance loss, enabling effective selection of frames from candidate frames. Extensive experiments across ActivityNet-v1.3, Mini-Kinetics, Mini-Sports1M, and COIN with CNN and Transformer backbones show that the semi-optimal approach yields stable, high performance for both small and large values of and , often outperforming methods that search the full combinatorial space. The proposed method also demonstrates improved computational efficiency, achieving higher throughput with lower GFLOPs. Overall, the work shifts the focus from exploring large search spaces to exploiting a principled independence-based scoring to achieve scalable, accurate video classification.

Abstract

Given a video with frames, frame sampling is a task to select frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of , especially when gets large. To address this challenge, we introduce a novel perspective of reducing the search space from to . Instead of exploring the entire space, our proposed semi-optimal policy selects the top frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of and .
Paper Structure (23 sections, 8 equations, 8 figures, 9 tables)

This paper contains 23 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Distribution of $\mathcal{I}(\mathbf{x}_i, \mathbf{x}_j)$ with kernel density estimate.
  • Figure 2: Illustration of the Semi-Optimal Policy $\bm{\pi_s}$. We briefly illustrate how $\pi_s$ works on two different architecture. The numbers in the pink and blue boxes represent the frame indices sampled with ResNet50 and TimeSformer as the backbone, respectively, when $N=6$.
  • Figure 3: SOSampler Algorithm. SOSampler consists of a sampler $S_\text{SO}$ and a classifier $f_c$, which can be any model architecture.
  • Figure 4: Performance by sampler across N/T on ActivityNet-v1.3.
  • Figure 5: Experiment on long videos for large $N$ and $T$. The best performing model is bold-faced.
  • ...and 3 more figures