Table of Contents
Fetching ...

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

Mohamed Fazli Imam, Chenyang Lyu, Alham Fikri Aji

TL;DR

Current multimodal LLMs struggle with visual temporal understanding, a critical capability for dynamic real-world reasoning. The authors introduce TemporalVQA, a two-task benchmark evaluating Temporal Order Understanding and Time-lapse Estimation using carefully annotated image pairs and varied prompts/layouts. Evaluations across GPT-4o, Gemini-1.5-Pro, and open-source MLLMs show substantial gaps to human performance, with order-judgment accuracy near random and time-lapse accuracy around 70% at best, highlighting a fundamental limitation in current models' temporal reasoning. The work provides a public dataset and analysis framework to spur advances in temporal capabilities for multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as visual temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: 1) Temporal Order Understanding and 2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges: GPT-4o achieved only 49.1% average consistent accuracy in temporal order task and 70% in time-lapse estimation, with open-source models performing even poorly. These findings underscore the limitations of current MLLMs in visual temporal understanding and reasoning, highlighting the need for further improvements for their temporal capability. Our dataset can be found at https://huggingface.co/datasets/fazliimam/temporal-vqa.

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

TL;DR

Current multimodal LLMs struggle with visual temporal understanding, a critical capability for dynamic real-world reasoning. The authors introduce TemporalVQA, a two-task benchmark evaluating Temporal Order Understanding and Time-lapse Estimation using carefully annotated image pairs and varied prompts/layouts. Evaluations across GPT-4o, Gemini-1.5-Pro, and open-source MLLMs show substantial gaps to human performance, with order-judgment accuracy near random and time-lapse accuracy around 70% at best, highlighting a fundamental limitation in current models' temporal reasoning. The work provides a public dataset and analysis framework to spur advances in temporal capabilities for multimodal models.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as visual temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: 1) Temporal Order Understanding and 2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges: GPT-4o achieved only 49.1% average consistent accuracy in temporal order task and 70% in time-lapse estimation, with open-source models performing even poorly. These findings underscore the limitations of current MLLMs in visual temporal understanding and reasoning, highlighting the need for further improvements for their temporal capability. Our dataset can be found at https://huggingface.co/datasets/fazliimam/temporal-vqa.
Paper Structure (26 sections, 3 figures, 7 tables)

This paper contains 26 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Performance comparison across TemporalVQA tasks. The plot shows the accuracy (%) of different models on Temporal Order Understanding (orange) and Timelapse Estimation (green). The accuracy shown for Temporal Order Understanding is averaged consistent accuracy across all different layouts.
  • Figure 2: An introductory diagram illustrating the task setup for the TemporalVQA benchmark. In Temporal Order Understanding, the model is asked to determine which of the two images depicts the event that happened first. In Time-lapse Estimation, the model estimates the time duration between two images selecting from options like seconds, minutes, hours, days, weeks/months or years.
  • Figure 3: Some qualitative cases illustrating the output predictions from GPT4o. 1st order refers to cases where the image pairs are fed to the model in their original sequence, while 2nd order refers to cases where the image pairs are fed in reverse (swapped) order. Text highlighted in green represents correct classifications while red indicates misclassifications. Orange denotes instances of hallucinations, and brown denotes instances of illogical reasoning.