Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Vaishnavi Himakunthala; Andy Ouyang; Daniel Rose; Ryan He; Alex Mei; Yujie Lu; Chinmay Sonar; Michael Saxon; William Yang Wang

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang

TL;DR

VIP introduces an inference-time dataset for evaluating video reasoning through video chain-of-thought over keyframes. It provides two textual representations per keyframe—unstructured dense captions and structured FAMOuS descriptions—and defines Video Infilling and Video Prediction tasks to probe multi-hop, multi-frame reasoning using language models. The dataset construction combines automated keyframe extraction with grounding and crowdsourced quality checks, enabling scalable evaluation on real-world videos. Experiments with GPT-4, GPT-3, and Vicuna reveal that current models show potential but face substantial gaps in robust video reasoning, motivating future work on more integrated video-language reasoning and generation capabilities.

Abstract

Despite exciting recent results showing vision-language systems' capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of processing videos. To evaluate this novel application, we introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. Inspired by visually descriptive scene plays, we propose two formats for keyframe description: unstructured dense captions and structured scene descriptions that identify the focus, action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video reasoning, we propose two tasks: Video Infilling and Video Prediction, which test abilities to generate multiple intermediate keyframes and predict future keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in these complex video reasoning tasks, and encourage future work to prioritize language models for efficient and generalized video reasoning.

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

TL;DR

Abstract

Paper Structure (36 sections, 9 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 9 figures, 6 tables, 1 algorithm.

Introduction
Related Work
AI Reasoning.
Datasets for Video Understanding.
Textual Representations of Videos.
VIP Dataset Construction
Representative Keyframe Selection
Selecting Video Frames.
Pruning Redundant Frames.
Textual Representations of Keyframes
Unstructured, Dense Captions.
FAMOuS Structured Scene Descriptions.
Dataset Contributions
Video Reasoning Tasks
Video Infilling Task
...and 21 more sections

Figures (9)

Figure 1: The Video Infilling and Prediction Dataset consists of two ways to describe keyframes: an unstructured dense caption and a structured scene description with five components: Focus, Action, Mood, Objects, and Setting (FAMOuS). The unstructured dense captions are highly detailed dense captions that can promote visually descriptive reasoning tasks, while structured scene description provide a concise, visual description of the keyframe that can aid in more focused reasoning tasks.
Figure 2: Distribution of VIP's real-world video domains, weighted to emphasize videos containing significant visual change.
Figure 3: Overview of the pipeline to generate the scene descriptions provided in the VIP Dataset. We first process a video and extract the important frames (\ref{['subsec:keyframe-selection']}), then generate scene descriptions by extracting visual information from each keyframe, along with grounding information from the video to offset model hallucinations. We then feed in the extracted information into GPT-4 to generate the dense captions and structured scene descriptions. (\ref{['subsec:generating-frame-descriptions']}).
Figure 4: frame_extract($v$, $c$, $f$)
Figure 5: Given a number of context frames, the frame prediction task requires models to predict the following $n$ frames. In this example, we provide two FAMOuS scene descriptions and use Vicuna and GPT-4 to predict the next three frames. Results emphasized in red differ from the ground truth.
...and 4 more figures

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

TL;DR

Abstract

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Authors

TL;DR

Abstract

Table of Contents

Figures (9)