Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!
Mohamed Fazli Imam, Chenyang Lyu, Alham Fikri Aji
TL;DR
Current multimodal LLMs struggle with visual temporal understanding, a critical capability for dynamic real-world reasoning. The authors introduce TemporalVQA, a two-task benchmark evaluating Temporal Order Understanding and Time-lapse Estimation using carefully annotated image pairs and varied prompts/layouts. Evaluations across GPT-4o, Gemini-1.5-Pro, and open-source MLLMs show substantial gaps to human performance, with order-judgment accuracy near random and time-lapse accuracy around 70% at best, highlighting a fundamental limitation in current models' temporal reasoning. The work provides a public dataset and analysis framework to spur advances in temporal capabilities for multimodal models.
Abstract
Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as visual temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: 1) Temporal Order Understanding and 2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges: GPT-4o achieved only 49.1% average consistent accuracy in temporal order task and 70% in time-lapse estimation, with open-source models performing even poorly. These findings underscore the limitations of current MLLMs in visual temporal understanding and reasoning, highlighting the need for further improvements for their temporal capability. Our dataset can be found at https://huggingface.co/datasets/fazliimam/temporal-vqa.
