Large Model based Sequential Keyframe Extraction for Video Summarization
Kailong Tan, Yuxiang Zhou, Qianchen Xia, Rui Liu, Yong Chen
TL;DR
This work tackles concise video summarization by extracting sequential keyframes that preserve semantics and temporal order. It introduces LMSKE, a three-stage pipeline that leverages large models for precise shot segmentation (TransNetV2) and frame-level semantic embeddings (CLIP), followed by adaptive clustering with $k_{max}=\sqrt{n}$ and silhouette optimization, and final redundancy elimination using color-histogram similarity. The approach achieves superior average metrics on the TVSum20 benchmark (F1 = 0.5311, Fidelity = 0.8141, CR = 0.9922), outperforming several state-of-the-art methods. A public TVSum20 dataset and a scalable, model-driven framework offer practical impact for efficient video indexing and retrieval with minimal but informative keyframes.
Abstract
Keyframe extraction aims to sum up a video's semantics with the minimum number of its frames. This paper puts forward a Large Model based Sequential Keyframe Extraction for video summarization, dubbed LMSKE, which contains three stages as below. First, we use the large model "TransNetV21" to cut the video into consecutive shots, and employ the large model "CLIP2" to generate each frame's visual feature within each shot; Second, we develop an adaptive clustering algorithm to yield candidate keyframes for each shot, with each candidate keyframe locating nearest to a cluster center; Third, we further reduce the above candidate keyframes via redundancy elimination within each shot, and finally concatenate them in accordance with the sequence of shots as the final sequential keyframes. To evaluate LMSKE, we curate a benchmark dataset and conduct rich experiments, whose results exhibit that LMSKE performs much better than quite a few SOTA competitors with average F1 of 0.5311, average fidelity of 0.8141, and average compression ratio of 0.9922.
