Table of Contents
Fetching ...

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong

TL;DR

This work addresses long-video understanding under strict frame-budget constraints by introducing Visual Semantic-Logical Search (VSLS), a framework that identifies semantically critical keyframes through four logical relations: spatial, temporal, attribute, and causal. VSLS extracts query-driven visual elements, adaptively samples frames, and iteratively updates a sampling distribution via multi-relational reasoning to achieve high QA performance with only 1.4% of frames processed on average. It demonstrates state-of-the-art keyframe retrieval metrics and significant QA gains on LongVideoBench and Video-MME, including an 8.7 percentage-point improvement in GPT-4o long-video QA accuracy. The method is training-free, plug-and-play, and demonstrates favorable efficiency-accuracy trade-offs compared with baselines, making it practical for real-world long-video understanding tasks.

Abstract

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

TL;DR

This work addresses long-video understanding under strict frame-budget constraints by introducing Visual Semantic-Logical Search (VSLS), a framework that identifies semantically critical keyframes through four logical relations: spatial, temporal, attribute, and causal. VSLS extracts query-driven visual elements, adaptively samples frames, and iteratively updates a sampling distribution via multi-relational reasoning to achieve high QA performance with only 1.4% of frames processed on average. It demonstrates state-of-the-art keyframe retrieval metrics and significant QA gains on LongVideoBench and Video-MME, including an 8.7 percentage-point improvement in GPT-4o long-video QA accuracy. The method is training-free, plug-and-play, and demonstrates favorable efficiency-accuracy trade-offs compared with baselines, making it practical for real-world long-video understanding tasks.

Abstract

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Paper Structure

This paper contains 65 sections, 18 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of four types of visual semantic-logical relationships in video QA detected by our VSLS framework: Temporal(text, time, pen), Attribute(man, attribute, white shirt), Spatial (copilot, spatial, Egyptian Pyramids), and Causal(man, causal, basketball). Green boxes indicate correct answers, while red boxes show baseline errors.
  • Figure 2: Our VSLS Framework for Efficient Keyframe Selection.VSLS sparsely samples frames and selects key ones via object detection and logic verification. Steps: 1) Use LLM&VLM to extract cue/target objects and four logic types (spatial, temporal, attribute, causal); 2) Adaptive sampling with evolving confidence; 3) Detect objects via YOLO-World; 4) Fuse scores with a spline function to identify high-confidence frames for downstream tasks.
  • Figure 3: Sample weight evolution under VSLS optimization for keyframe selection. Top: 16 iterations show progressive convergence toward Ground Truth (red). Bottom: 15 iterations demonstrate similar alignment. Yellow highlights indicate precise matches between algorithm outputs (green) and manual annotations.
  • Figure 4: Average occurrences of detected semantic-logical relation types per question on the VideoMME and LongVideoBench datasets. Spatial relations are the most frequently identified, while all queries in both datasets triggered at least one of the four relation types.
  • Figure 5: Performance improvement with increasing search frames. VSLS consistently enhances accuracy and reaches near-human oracle performance at 64 frames.
  • ...and 2 more figures