VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Jianxiang He, Meisheng Hong, Jungang Li, Ziyang Chen, Weiyu Guo, Xuming Hu, Hui Xiong
TL;DR
VSI tackles the challenge of long-video understanding under context and computation limits by introducing Visual-Subtitle Integration, a dual-branch keyframe retrieval framework that fuses visual search and subtitle matching. The method uses adaptive, iterative frame sampling guided by both object-centric visual cues and textual subtitle similarity, with a spline-based score distribution update and a sigmoid-normalized frame selection process. VSI demonstrates state-of-the-art keyframe retrieval on LongVideoBench and Video-MME, while delivering substantial improvements on text-related tasks and downstream VideoQA without additional training. The work offers a practical, plug-and-play solution that enhances multimodal long-video understanding with efficient sampling and robust cross-modal interaction.
Abstract
Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
