Table of Contents
Fetching ...

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu, Shuiguang Deng, Chang Liu, Naibo Wang, Jianwei Yin

TL;DR

Video-QTR reframes long-video understanding as a query-driven temporal reasoning problem, coupling a Reason Temporal Proxy with a lightweight Perception Module, Temporal Consistency Refiner, and Temporal Memory to enable selective perception and iterative reasoning. The method achieves state-of-the-art results across short- and long-video QA benchmarks while substantially reducing perceptual frame processing, demonstrating improved scalability for long-horizon video understanding. Ablation and qualitative analyses confirm the critical roles of RTP, TM, and TCR, and reveal robust performance across diverse temporal contexts. This work offers a practical, efficient path toward real-world video understanding and lays a foundation for hierarchical and interactive reasoning systems.

Abstract

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

TL;DR

Video-QTR reframes long-video understanding as a query-driven temporal reasoning problem, coupling a Reason Temporal Proxy with a lightweight Perception Module, Temporal Consistency Refiner, and Temporal Memory to enable selective perception and iterative reasoning. The method achieves state-of-the-art results across short- and long-video QA benchmarks while substantially reducing perceptual frame processing, demonstrating improved scalability for long-horizon video understanding. Ablation and qualitative analyses confirm the critical roles of RTP, TM, and TCR, and reveal robust performance across diverse temporal contexts. This work offers a practical, efficient path toward real-world video understanding and lays a foundation for hierarchical and interactive reasoning systems.

Abstract

The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

Paper Structure

This paper contains 45 sections, 10 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overall performance comparison with state-of-the-art models. We compare our method, Video-QTR, against leading video language models across five key benchmarks, covering both short video (MSVD-QA, ActivityNet-QA) and long video (MovieChat, VideoMME) understanding tasks.
  • Figure 2: Framework overview. Video‑QTR performs query‑driven temporal reasoning through four cooperating modules. The Reason–Temporal Proxy (RTP) decomposes the query into temporal episodes along the video timeline. The Perception Module uses an MLLM backbone to selectively fetch visual evidence from relevant segments. The Temporal Consistency Refiner (TCR) evaluates and refines the chronological order between reasoning and observation. The Temporal Memory (TM) maintains an event graph that stores and updates semantic and temporal relations across iterations, producing temporally consistent answers for long‑video understanding.
  • Figure 3: Ablation study on MovieChat benchmark. RTP and TCR significantly impact performance, while TM enhances long horizon stability and contextual reasoning.
  • Figure 4: Qualitative Comparison of Video-QTR on Long Video Understanding. Our Video-QTR demonstrates superior temporal reasoning and accurate detail perception, outperforming leading LLMs.
  • Figure 5: Performance comparison across video durations. Video-QTR outperforms all methods at every evaluation point, with accuracy decreasing for most models as video duration increases.
  • ...and 6 more figures