Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang, Mu Cai, Yong Jae Lee
TL;DR
The paper argues that short video understanding by modern LMMs remains limited in dense temporal reasoning, despite perceived advances. It introduces Vinoground, a temporal counterfactual benchmark of 1000 short video-caption pairs, generated with identical word content but permuted event order, and curated from VATEX with GPT-4o-assisted caption construction and human verification. The study evaluates CLIP-based and generative LMMs using text, video, and group scores, benchmarked against Prolific human performance, revealing that most models perform near random and humans substantially outperform them (roughly 90% vs. 50-60% for text and 30-40% for video among models). Key findings show that temporal reasoning in short videos is not yet solved, with performance improving only modestly as frame counts rise and with clear category-based differences (viewpoint/contextual easier than object/action). Vinoground thus provides a rigorous, naturalistic checkpoint for developing temporally aware multimodal models and influences future research toward robust short-video temporal understanding.
Abstract
There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io.
