Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Junho Lee; Jeongwoo Shin; Seung Woo Ko; Seongsu Ha; Joonseok Lee

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Junho Lee, Jeongwoo Shin, Seung Woo Ko, Seongsu Ha, Joonseok Lee

TL;DR

The paper tackles scalable frame sampling for video classification by introducing a semi-optimal policy that reduces the search space from $O(T^N)$ to $O(T)$ under a frame-independence assumption observed in practical frame rates. It presents SOSampler, which learns this policy by distilling per-frame classifier confidence through a pairwise ranking loss and a label-guidance loss, enabling effective selection of $N$ frames from $T$ candidate frames. Extensive experiments across ActivityNet-v1.3, Mini-Kinetics, Mini-Sports1M, and COIN with CNN and Transformer backbones show that the semi-optimal approach yields stable, high performance for both small and large values of $N$ and $T$, often outperforming methods that search the full combinatorial space. The proposed method also demonstrates improved computational efficiency, achieving higher throughput with lower GFLOPs. Overall, the work shifts the focus from exploring large search spaces to exploiting a principled independence-based scoring to achieve scalable, accurate video classification.

Abstract

Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

TL;DR

The paper tackles scalable frame sampling for video classification by introducing a semi-optimal policy that reduces the search space from

under a frame-independence assumption observed in practical frame rates. It presents SOSampler, which learns this policy by distilling per-frame classifier confidence through a pairwise ranking loss and a label-guidance loss, enabling effective selection of

frames from

candidate frames. Extensive experiments across ActivityNet-v1.3, Mini-Kinetics, Mini-Sports1M, and COIN with CNN and Transformer backbones show that the semi-optimal approach yields stable, high performance for both small and large values of

and

, often outperforming methods that search the full combinatorial space. The proposed method also demonstrates improved computational efficiency, achieving higher throughput with lower GFLOPs. Overall, the work shifts the focus from exploring large search spaces to exploiting a principled independence-based scoring to achieve scalable, accurate video classification.

Abstract

Given a video with

frames, frame sampling is a task to select

frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of

, especially when

gets large. To address this challenge, we introduce a novel perspective of reducing the search space from

. Instead of exploring the entire

space, our proposed semi-optimal policy selects the top

frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of

and

Paper Structure (23 sections, 8 equations, 8 figures, 9 tables)

This paper contains 23 sections, 8 equations, 8 figures, 9 tables.

Introduction
Related Work
Problem Formulation
Method
Semi-Optimal Policy
SOSampler: Semi-optimal Policy-based Sampler
Experiment
Experimental Setup
Results and Analysis
Ablation Study
Ablation Study
Conclusion
Detailed Descriptions
Model Configuration
Dataset
...and 8 more sections

Figures (8)

Figure 1: Distribution of $\mathcal{I}(\mathbf{x}_i, \mathbf{x}_j)$ with kernel density estimate.
Figure 2: Illustration of the Semi-Optimal Policy $\bm{\pi_s}$. We briefly illustrate how $\pi_s$ works on two different architecture. The numbers in the pink and blue boxes represent the frame indices sampled with ResNet50 and TimeSformer as the backbone, respectively, when $N=6$.
Figure 3: SOSampler Algorithm. SOSampler consists of a sampler $S_\text{SO}$ and a classifier $f_c$, which can be any model architecture.
Figure 4: Performance by sampler across N/T on ActivityNet-v1.3.
Figure 5: Experiment on long videos for large $N$ and $T$. The best performing model is bold-faced.
...and 3 more figures

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

TL;DR

Abstract

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Authors

TL;DR

Abstract

Table of Contents

Figures (8)