Table of Contents
Fetching ...

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Jianxiang He, Meisheng Hong, Jungang Li, Ziyang Chen, Weiyu Guo, Xuming Hu, Hui Xiong

TL;DR

VSI tackles the challenge of long-video understanding under context and computation limits by introducing Visual-Subtitle Integration, a dual-branch keyframe retrieval framework that fuses visual search and subtitle matching. The method uses adaptive, iterative frame sampling guided by both object-centric visual cues and textual subtitle similarity, with a spline-based score distribution update and a sigmoid-normalized frame selection process. VSI demonstrates state-of-the-art keyframe retrieval on LongVideoBench and Video-MME, while delivering substantial improvements on text-related tasks and downstream VideoQA without additional training. The work offers a practical, plug-and-play solution that enhances multimodal long-video understanding with efficient sampling and robust cross-modal interaction.

Abstract

Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

TL;DR

VSI tackles the challenge of long-video understanding under context and computation limits by introducing Visual-Subtitle Integration, a dual-branch keyframe retrieval framework that fuses visual search and subtitle matching. The method uses adaptive, iterative frame sampling guided by both object-centric visual cues and textual subtitle similarity, with a spline-based score distribution update and a sigmoid-normalized frame selection process. VSI demonstrates state-of-the-art keyframe retrieval on LongVideoBench and Video-MME, while delivering substantial improvements on text-related tasks and downstream VideoQA without additional training. The work offers a practical, plug-and-play solution that enhances multimodal long-video understanding with efficient sampling and robust cross-modal interaction.

Abstract

Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.

Paper Structure

This paper contains 20 sections, 17 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison with Baseline. In both medium and long Video Settings of LongVideoBench and Video-MME datasets, the performance of VSI method consistently outperforms the three baseline models (GPT-4o, LLaVA-Video-7B-Qwen2, and Qwen2.5-VL-7B-Instruct) using uniform sampling strategy.
  • Figure 2: Framework for Visual-Subtitle Integration. The dual-branch architecture comprises: (a) Video Search branch leveraging YOLO-World to identify query-relevant objects; (b) Subtitle Match branch employing contrastive embeddings to retrieve subtitle-matching segments; (c) Confidence scores from both modalities are fused to update frame-wise relevance probabilities through spline interpolation. After the iteration, high-confidence frames were subsequently propagated to downstream QA tasks. The figure shows a complete real example from keyframe search to the completion of a VideoQA task.
  • Figure 3: Case Study. (a)Sampling probability distribution of the Video Search branch at different iteration counts; (b)Influence of the Subtitle Match branch on the sampling probability distribution of the Video Search branch; (c)Timestamp corresponding to the subtitle with the highest similarity in the Subtitle Match branch. The three figures collectively illustrate how the textual semantic information provided by the Subtitle Match branch dynamically influences the sampling strategy of the Video Search branch.