Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang
TL;DR
VIP introduces an inference-time dataset for evaluating video reasoning through video chain-of-thought over keyframes. It provides two textual representations per keyframe—unstructured dense captions and structured FAMOuS descriptions—and defines Video Infilling and Video Prediction tasks to probe multi-hop, multi-frame reasoning using language models. The dataset construction combines automated keyframe extraction with grounding and crowdsourced quality checks, enabling scalable evaluation on real-world videos. Experiments with GPT-4, GPT-3, and Vicuna reveal that current models show potential but face substantial gaps in robust video reasoning, motivating future work on more integrated video-language reasoning and generation capabilities.
Abstract
Despite exciting recent results showing vision-language systems' capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of processing videos. To evaluate this novel application, we introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. Inspired by visually descriptive scene plays, we propose two formats for keyframe description: unstructured dense captions and structured scene descriptions that identify the focus, action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video reasoning, we propose two tasks: Video Infilling and Video Prediction, which test abilities to generate multiple intermediate keyframes and predict future keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in these complex video reasoning tasks, and encourage future work to prioritize language models for efficient and generalized video reasoning.
