Large Model based Sequential Keyframe Extraction for Video Summarization

Kailong Tan; Yuxiang Zhou; Qianchen Xia; Rui Liu; Yong Chen

Large Model based Sequential Keyframe Extraction for Video Summarization

Kailong Tan, Yuxiang Zhou, Qianchen Xia, Rui Liu, Yong Chen

TL;DR

This work tackles concise video summarization by extracting sequential keyframes that preserve semantics and temporal order. It introduces LMSKE, a three-stage pipeline that leverages large models for precise shot segmentation (TransNetV2) and frame-level semantic embeddings (CLIP), followed by adaptive clustering with $k_{max}=\sqrt{n}$ and silhouette optimization, and final redundancy elimination using color-histogram similarity. The approach achieves superior average metrics on the TVSum20 benchmark (F1 = 0.5311, Fidelity = 0.8141, CR = 0.9922), outperforming several state-of-the-art methods. A public TVSum20 dataset and a scalable, model-driven framework offer practical impact for efficient video indexing and retrieval with minimal but informative keyframes.

Abstract

Keyframe extraction aims to sum up a video's semantics with the minimum number of its frames. This paper puts forward a Large Model based Sequential Keyframe Extraction for video summarization, dubbed LMSKE, which contains three stages as below. First, we use the large model "TransNetV21" to cut the video into consecutive shots, and employ the large model "CLIP2" to generate each frame's visual feature within each shot; Second, we develop an adaptive clustering algorithm to yield candidate keyframes for each shot, with each candidate keyframe locating nearest to a cluster center; Third, we further reduce the above candidate keyframes via redundancy elimination within each shot, and finally concatenate them in accordance with the sequence of shots as the final sequential keyframes. To evaluate LMSKE, we curate a benchmark dataset and conduct rich experiments, whose results exhibit that LMSKE performs much better than quite a few SOTA competitors with average F1 of 0.5311, average fidelity of 0.8141, and average compression ratio of 0.9922.

Large Model based Sequential Keyframe Extraction for Video Summarization

TL;DR

and silhouette optimization, and final redundancy elimination using color-histogram similarity. The approach achieves superior average metrics on the TVSum20 benchmark (F1 = 0.5311, Fidelity = 0.8141, CR = 0.9922), outperforming several state-of-the-art methods. A public TVSum20 dataset and a scalable, model-driven framework offer practical impact for efficient video indexing and retrieval with minimal but informative keyframes.

Abstract

Paper Structure (12 sections, 5 equations, 3 figures, 1 table)

This paper contains 12 sections, 5 equations, 3 figures, 1 table.

Introduction
Method
Shot Segmentation and Feature Exrtaction
Adaptive Clustering
Redundancy Elimination
EXPERIMENT
Dataset
Metrics and Competitors
Results
Case Study
Conclusion
Acknowledgement

Figures (3)

Figure 1: Our LMSKE framework: shot segmentation, adaptive clustering, and redundancy elimination.
Figure 2: The flow chart of the adaptive clustering algorithm.
Figure 3: Qualitative comparisons between the benchmark and the representative methods (such as LMSKE, INCEPTION, LBP-SHOT, UID, VSUMM, and GMC). Note that Uniform and K-Means (appeared in Table \ref{['tab:experiment_result']}) are not illustrated here because the numbers of their selected keyframes are relatively large and obviously they perform worse than other competitors.

Large Model based Sequential Keyframe Extraction for Video Summarization

TL;DR

Abstract

Large Model based Sequential Keyframe Extraction for Video Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)