Table of Contents
Fetching ...

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

TL;DR

This work introduces VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events, and proposes StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption.

Abstract

A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision or video models and a move towards long-video understanding. While exciting, we take a step back and ask: are current models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (44.5%) and Gemini-1.5-Pro (49.3%), are far from human accuracy at 93.0%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

TL;DR

This work introduces VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events, and proposes StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption.

Abstract

A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision or video models and a move towards long-video understanding. While exciting, we take a step back and ask: are current models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (44.5%) and Gemini-1.5-Pro (49.3%), are far from human accuracy at 93.0%. Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation. We also present challenges with ClassicVLE and multiple-choice (MC) evaluation, strengthening our preference for StrictVLE. Finally, we validate that our benchmark requires visual inputs of multiple frames making it ideal to study video-language compositional reasoning.
Paper Structure (59 sections, 1 equation, 18 figures, 11 tables)

This paper contains 59 sections, 1 equation, 18 figures, 11 tables.

Figures (18)

  • Figure 1: A Venn diagram grouping VELOCITI's seven tests (in black) that evaluate a Video-LLM across different facets: Agent Understanding, Action Understanding, and Multi-event Understanding. The benchmark is formulated as video-language entailment, where negative captions are created by manipulating text (Text-inspired Negation) or from other parts of the same video (In-Video Negation). Best seen in color.
  • Figure 2: VELOCITI evaluates Video-LLMs' video-language entailment capabilities on complex movie clips with dense semantic role label (SRL) annotations from the VidSitu dataset vidsitu. Positive and negative captions are shown side-by-side for each test with the key difference highlighted with green/red. Negative captions are created by (i) manipulating text using an LLM (Text-Inspired Negation) or (ii) replacing agents or actions by others that appear in the same video (In-Video Negation). We also demonstrate how the same positive caption can be used to create negative captions differently (see Agent Random vs. Agent Binding test; or Action Adversarial vs. Action Binding test). Each test evaluates models for different facets of compositional reasoning as described in \ref{['subsec:tests']}. The 10s video clip used in this example can be viewed here: https://www.youtube.com/embed/bt6-F11LZsQ?start=25&end=35.
  • Figure 3: Scatter plot of entailment scores $e(V, C^+)$ (x-axis) and $e(V, C^-)$ (y-axis) for all tests in VELOCITI subset. We visualize the scores for Video-LLaVA (top) and OV-72B (bottom). ClassicVLE calls a sample correct in the region below the diagonal (light green). Instead, StrictVLE requires the dots to lie in the yellow bottom-right quadrant (dark green). Finally, samples whose points are above the diagonal are wrong for both VLE metrics (red). While recent models have improved, older models concentrate near the diagonal and in the top-right 'Yes' quadrant for both captions. The legend includes the actual number of points (please zoom in). Figure is best seen in color.
  • Figure 4: Scatter plot of entailment scores $e(V, C^+)$ (x-axis) and $e(V, C^-)$ (y-axis) for all tests in VELOCITI. We visualize the scores for several models indicated in the left margin. From top to bottom: P-LLaVA, OwlCon, Video-LLaVA, OV-7B-SI, OV-7B, QVL-7B, and OV-72B. ClassicVLE calls a sample correct in the region below the diagonal (light green). Instead, StrictVLE requires the dots to lie in the yellow bottom-right quadrant (dark green). Finally, samples whose points are above the diagonal are wrong for both VLE metrics (red). The legend includes the actual number of points (please zoom in). This figure is best seen in color.
  • Figure 5: CoT evaluation prompt.
  • ...and 13 more figures