Table of Contents
Fetching ...

VITED: Video Temporal Evidence Distillation

Yujie Lu, Yale Song, William Wang, Lorenzo Torresani, Tushar Nagarajan

TL;DR

VITED tackles the challenge of complex, temporally grounded reasoning in long-form VideoQA by generating and distilling temporal evidence chains. It automatically constructs an evidence pool from video segments across multiple granularities, refines and searches for coherent evidence chains, and trains a temporally aware VLM in a two-stage curriculum to answer questions while producing evidence chains. The approach yields state-of-the-art or competitive results on six VideoQA benchmarks, improves grounding accuracy, and offers interpretable chain-of-thought explanations. By integrating evidence generation, grounding, and reasoning into a single model, ViTED advances temporally grounded understanding of video content with practical implications for robust, explainable AI in video analysis.

Abstract

We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.

VITED: Video Temporal Evidence Distillation

TL;DR

VITED tackles the challenge of complex, temporally grounded reasoning in long-form VideoQA by generating and distilling temporal evidence chains. It automatically constructs an evidence pool from video segments across multiple granularities, refines and searches for coherent evidence chains, and trains a temporally aware VLM in a two-stage curriculum to answer questions while producing evidence chains. The approach yields state-of-the-art or competitive results on six VideoQA benchmarks, improves grounding accuracy, and offers interpretable chain-of-thought explanations. By integrating evidence generation, grounding, and reasoning into a single model, ViTED advances temporally grounded understanding of video content with practical implications for robust, explainable AI in video analysis.

Abstract

We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.

Paper Structure

This paper contains 45 sections, 1 equation, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Main idea. We produce multiple, temporally localized pieces of evidence (the "evidence chain’’) to support complex reasoning in VideoQA. Our ViTED model is trained to generate this evidence chain to enable temporally-grounded chain-of-thought reasoning in video.
  • Figure 2: Overview of ViTED evidence generation framework. There are three main stages: (1) We first generate the evidence pool --- detailed captions for segments at multiple granularities --- and rank them based on relevance to the question (left, Sec.\ref{['sec:evidence_pool']}) (2) Next, we search over the evidence pool to derive evidence chains that are most predictive of the target answer, and summarize it into a coherent and logical chain-of-thought (top-right, Sec. \ref{['sec:evidence_search']}) (3) Finally, if the evidence chain successfully leads to the correct answer, we add it to our dataset for training our model (bottom-right, Section \ref{['sec:evidence_distill']})
  • Figure 3: Example of temporal evidence on NExT-QA.
  • Figure 4: Analysis of evidence quality.Left: Human evaluation score on the quality of temporal evidence chain-of-thought. Right: Distribution of the number of hops in synthesized evidence chain across four datasets.
  • Figure 5: Examples of generated evidence chains. Compared to traditional chain-of-thought approaches, ViTED demonstrates temporal evidence generation and reasoning capabilities, accurately analyzing the sequence of actions in the video to reach the correct final answer. Colored text and highlights are for visualization only and correspond to foowrong wrong evidence, foocorrect correct evidence and temporal localization windows of generated evidence (blue text).
  • ...and 2 more figures