Table of Contents
Fetching ...

What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi

TL;DR

VLMM benchmarks may overstate temporal understanding due to priors, so the authors introduce VECTOR to explicitly test event-order reasoning with synthetic multi-event videos. They show that current models rely heavily on plausible priors, not genuine temporal cues, and propose MECoT to improve temporal reasoning through event-level instruction fine-tuning plus inference-time Chain-of-Thought. Vector demonstrates strong diagnostic power for both event- and pattern-level temporal understanding, and MECoT yields consistent gains on VECTOR and existing benchmarks, indicating practical improvements in temporal comprehension. The work suggests that explicit temporal reasoning mechanisms are essential for reliable video understanding in VLMMs and provides datasets, prompts, and training strategies to advance this capability.

Abstract

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.

What Happens When: Learning Temporal Orders of Events in Videos

TL;DR

VLMM benchmarks may overstate temporal understanding due to priors, so the authors introduce VECTOR to explicitly test event-order reasoning with synthetic multi-event videos. They show that current models rely heavily on plausible priors, not genuine temporal cues, and propose MECoT to improve temporal reasoning through event-level instruction fine-tuning plus inference-time Chain-of-Thought. Vector demonstrates strong diagnostic power for both event- and pattern-level temporal understanding, and MECoT yields consistent gains on VECTOR and existing benchmarks, indicating practical improvements in temporal comprehension. The work suggests that explicit temporal reasoning mechanisms are essential for reliable video understanding in VLMMs and provides datasets, prompts, and training strategies to advance this capability.

Abstract

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.

Paper Structure

This paper contains 52 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Limitations of existing video benchmarks and our VECTOR benchmark. (a) Normalized accuracy (Acc.) indicates the ratio of a model's accuracy on frame-shuffled videos to its accuracy on original videos. SoTA open-source VLMMs li2024llavaov (7B and 72B) achieve high accuracy on existing benchmarks but not on our proposed VECTOR. (b) We think existing benchmarks often allow models to succeed via prior knowledge of typical event scenarios, circumventing true temporal understanding presented in video. (c) In contrast, VECTOR explicitly evaluates event-order understanding independent of prior knowledge.
  • Figure 2: Biased prediction on event-ordering task. The model correctly predicts the event order ['A', 'B', 'C'] for the original video (top). However, when events A and C are swapped (bottom), it maintains the same prediction, revealing its reliance on plausible scenario over visual temporal information.
  • Figure 3: Overview of the VECTOR benchmark. Two evaluation groups: Left: Event‐level tasks assess event sequencing, relative sequencing, and position identification. Right: Pattern‐level tasks group events into semantic categories or recuring patterns, requiring models to detect anomalous event's position within sequences.
  • Figure 4: Overview of MECoT. MECoT includes two stages: (a) Instruction fine-tuning on detailed multi-event descriptions to enhance event-level understanding. (b) Chain-of-thought (CoT) inference, where models generate structured narratives for videos, enabling explicit temporal reasoning before answering questions.
  • Figure 5: EM on event sequencing (Task 1) with varying event counts and input frames. We evaluate two proprietary VLMMs across different event and frame counts (indicated by legend). Even SoTA proprietary VLMMs exhibit substantial performance degradation as the number of events increases, while human performance remains stable. Notably, increasing input frames does not consistently enhance model performance.
  • ...and 5 more figures