Table of Contents
Fetching ...

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp

TL;DR

VideoINSTA contributes a zero-shot framework for long video understanding using LLMs, an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos, and a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence.

Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: https://github.com/mayhugotong/VideoINSTA.

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

TL;DR

VideoINSTA contributes a zero-shot framework for long video understanding using LLMs, an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos, and a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence.

Abstract

In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have become competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the complexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for zero-shot long-form video understanding. VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs; (2) an event-based temporal reasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in videos; (3) a self-reflective information reasoning scheme balancing temporal factors based on information sufficiency and prediction confidence. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA, and IntentQA, and the open question answering dataset ActivityNetQA. The code is released here: https://github.com/mayhugotong/VideoINSTA.
Paper Structure (58 sections, 2 equations, 9 figures, 18 tables, 1 algorithm)

This paper contains 58 sections, 2 equations, 9 figures, 18 tables, 1 algorithm.

Figures (9)

  • Figure 1: Framework of VideoINSTA. VideoINSTA consists of three phases. (1) Event-based Temporal Reasoning. Temporal Segmentation parses the video into events via proposed C-DPCKNN clustering, and Temporal Grounding derives semantic temporal information inherited from the global relevance of each event. (2) Content-based Spatial Reasoning. Action Captions are derived for each clip by video captioners as basic spatial information. Compensated with Object Detections, the spatial information is summarized to derive query-focused spatial information. (3) Self-reflective Information Reasoning. The previously derived spatial-temporal information is merged according to their information sufficiency in descending order and the LLM performs multi-round predictions after information merging until it comes to a confident self-evaluation.
  • Figure 2: Illustration of Temporal Reasoning in VideoINSTA. In Temporal Segmentation, the proposed C-DPCKNN sets clear borders with minimum density peaks. In Temporal Grounding, each event inherits the global relevance information derived from UniVTG according to these borders. The inherited local temporal information is transformed into semantic prompts, empowering temporal reasoning in VideoINSTA.
  • Figure 3: Spatial Reasoning in VideoINSTA.
  • Figure 4: Ablation on different temporal segmentation of VideoINSTA methods.
  • Figure 5: Ablation Studies on EgoSchema. (a) All three phases contribute to VideoINSTA. (b) $K=4$ is the best empirical clustering number for EgoSchema.
  • ...and 4 more figures