Table of Contents
Fetching ...

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Jianrui Zhang, Mu Cai, Yong Jae Lee

TL;DR

The paper argues that short video understanding by modern LMMs remains limited in dense temporal reasoning, despite perceived advances. It introduces Vinoground, a temporal counterfactual benchmark of 1000 short video-caption pairs, generated with identical word content but permuted event order, and curated from VATEX with GPT-4o-assisted caption construction and human verification. The study evaluates CLIP-based and generative LMMs using text, video, and group scores, benchmarked against Prolific human performance, revealing that most models perform near random and humans substantially outperform them (roughly 90% vs. 50-60% for text and 30-40% for video among models). Key findings show that temporal reasoning in short videos is not yet solved, with performance improving only modestly as frame counts rise and with clear category-based differences (viewpoint/contextual easier than object/action). Vinoground thus provides a rigorous, naturalistic checkpoint for developing temporally aware multimodal models and influences future research toward robust short-video temporal understanding.

Abstract

There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io.

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

TL;DR

The paper argues that short video understanding by modern LMMs remains limited in dense temporal reasoning, despite perceived advances. It introduces Vinoground, a temporal counterfactual benchmark of 1000 short video-caption pairs, generated with identical word content but permuted event order, and curated from VATEX with GPT-4o-assisted caption construction and human verification. The study evaluates CLIP-based and generative LMMs using text, video, and group scores, benchmarked against Prolific human performance, revealing that most models perform near random and humans substantially outperform them (roughly 90% vs. 50-60% for text and 30-40% for video among models). Key findings show that temporal reasoning in short videos is not yet solved, with performance improving only modestly as frame counts rise and with clear category-based differences (viewpoint/contextual easier than object/action). Vinoground thus provides a rigorous, naturalistic checkpoint for developing temporally aware multimodal models and influences future research toward robust short-video temporal understanding.

Abstract

There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at https://vinoground.github.io.
Paper Structure (25 sections, 7 equations, 8 figures, 6 tables)

This paper contains 25 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: GPT-4o answering a video-score question incorrectly. When asked which video matches the caption, which involves identifying the order of the two events mentioned, GPT-4o does not mention anything about the temporal order of events. The erroneous analyses are marked in red. It should also be noted that the analyses for both videos are completely wrong.
  • Figure 2: Example positive/negative video-caption pairs in Vinoground, for each category.
  • Figure 3: Group score for each model, grouped by category. One can observe higher performance in contextual and viewpoint, and lower performance on other categories.
  • Figure 4: The data curation process.
  • Figure 5: Visualization of the text and video score metrics.
  • ...and 3 more figures