Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong
TL;DR
This work addresses long-video understanding under strict frame-budget constraints by introducing Visual Semantic-Logical Search (VSLS), a framework that identifies semantically critical keyframes through four logical relations: spatial, temporal, attribute, and causal. VSLS extracts query-driven visual elements, adaptively samples frames, and iteratively updates a sampling distribution via multi-relational reasoning to achieve high QA performance with only 1.4% of frames processed on average. It demonstrates state-of-the-art keyframe retrieval metrics and significant QA gains on LongVideoBench and Video-MME, including an 8.7 percentage-point improvement in GPT-4o long-video QA accuracy. The method is training-free, plug-and-play, and demonstrates favorable efficiency-accuracy trade-offs compared with baselines, making it practical for real-world long-video understanding tasks.
Abstract
Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.
