Table of Contents
Fetching ...

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa

TL;DR

The paper interrogates whether vision–language models embed human-like temporal and causal inductive biases by evaluating their ability to judge the arrow of time in videos. It introduces AoT-PsyPhyBENCH, a psychophysically validated benchmark built from controlled, low-ambiguity clips with direct human baselines to enable reliable model–human comparisons. Across zero-shot, few-shot, chain-of-thought prompting, and supervised fine-tuning, the study finds that most models perform near chance and lag far behind humans, with reasoning and increased deliberation often amplifying forward biases rather than improving temporal understanding. The findings highlight a fundamental gap in current multimodal systems: while visual-semantic correlations are strong, robust temporal continuity and physical causality inductive biases are lacking. The authors release the benchmark, code, and model outputs to spur progress toward temporally aware and physically grounded VLMs.

Abstract

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

TL;DR

The paper interrogates whether vision–language models embed human-like temporal and causal inductive biases by evaluating their ability to judge the arrow of time in videos. It introduces AoT-PsyPhyBENCH, a psychophysically validated benchmark built from controlled, low-ambiguity clips with direct human baselines to enable reliable model–human comparisons. Across zero-shot, few-shot, chain-of-thought prompting, and supervised fine-tuning, the study finds that most models perform near chance and lag far behind humans, with reasoning and increased deliberation often amplifying forward biases rather than improving temporal understanding. The findings highlight a fundamental gap in current multimodal systems: while visual-semantic correlations are strong, robust temporal continuity and physical causality inductive biases are lacking. The authors release the benchmark, code, and model outputs to spur progress toward temporally aware and physically grounded VLMs.

Abstract

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

Paper Structure

This paper contains 28 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the arrow-of-time (AoT) task. Humans identify the AoT for both forward and backward playbacks with high accuracy; however, VLMs perform substantially worse and exhibit a label-prediction bias, preferring Forward (see Section \ref{['sec:label_prediction_bias']}).
  • Figure 2: Left: a backward video clip (category: Put). Top-Right: Qwen2.5VL-72B Multi-step CoT reasoning. Bottom-Right: Gemini-2.5-Pro's self-explained reasoning trace. Qwen2.5VL-72B correctly identified the event in the scene and made a valid assumption, but failed to observe that the event was reversed which led to an incorrect conclusion. In contrast, Gemini-2.5-Pro correctly detected the reversal of the event in Step 2 based on a valid assumption it made in Step 1.
  • Figure 3: Per-category comparison on AoT-PsyPhyBENCH across three representative models. Cosmos-reason1-7B (zero-shot; best open-weight in this setting), GPT-4.1 (zero-shot; best proprietary and best overall model in this setting), and Gemini-2.5-Pro (zero-shot, low-reasoning effort; best model across all settings), and humans: (a) forward F1 (left), (b) backward F1 (middle), and (c) overall accuracy (right). Humans remain consistently high across all categories and both directions ($\approx$ 80-100%). In contrast, VLMs show substantial gaps. Backward detection is the most challenging, revealing a forward-direction bias (with Cosmos-reason1 as a notable exception, showing comparatively strong backward F1).