Table of Contents
Fetching ...

Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

Chendong Wang, Donglin Bai, Yifan Yang, Xiao Jin, Anlan Zhang, Rui Wang, Shiqi Jiang, Yuqing Yang, Hao Wu, Qi Dai, Chong Luo, Ting Cao, Lili Qiu, Suman Banerjee

TL;DR

Video-in-the-Loop (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first localizing question-relevant interval(s) with a low-fps skim and then reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution.

Abstract

We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

TL;DR

Video-in-the-Loop (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first localizing question-relevant interval(s) with a low-fps skim and then reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution.

Abstract

We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

Paper Structure

This paper contains 46 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of ViTL (Video-in-the-Loop) and VGrounding-QA. ViTL (right-down): Given a long video $V$ and a question $q$, Stage 1 (Ground) takes a grounding query distilled from $q$ ("locate the moments needed to answer $q$") and predicts one or multiple relevant temporal spans $\mathcal{S}=\{[t_s^{(i)},t_e^{(i)}]\}$. Supervision comes from event-graph gold spans. Stage 2 (Answer) re-encodes only frames within $\mathcal{S}$ at higher fidelity (e.g., higher frame rate/resolution) and answers the original MCQA. Training follows an R1-style loop that jointly optimizes grounding (IoU-based) and QA (cross-entropy or reward) objectives, encouraging spans that improve answering. VGrounding-QA (right-top): The spanning aware training set is achieved from Event Knowledge Graph.
  • Figure 2: Training–set construction from event graphs via semantic chunking. A long video is first buffered into short uniform chunks (e.g., 3s) and produces per–chunk descriptions. Neighboring chunks with high textual similarity are merged into semantic segments; their absolute start/end times become the ground-truth span(s) (example: $\,30{:}58\!\rightarrow\!32{:}19\,$). Each segment is summarized into an event description and converted into a span-grounded MCQA instance whose question is answerable using only this span; distractors are mined from other events in the same video.
  • Figure 3: Qualitative demonstration of our two‑stage reasoning and grounding pipeline on a sample video from Charade-STA.
  • Figure 4: Qualitative demonstration of our two‑stage reasoning and grounding pipeline on a sample video from CG-Bench.
  • Figure 5: Qualitative demonstration of our two‑stage reasoning and grounding pipeline on a sample video from CG-Bench.