Table of Contents
Fetching ...

Seeing the Arrow of Time in Large Multimodal Models

Zihui Xue, Mi Luo, Kristen Grauman

TL;DR

This work addresses the critical gap in temporal directionality understanding (AoT) in large multimodal models by introducing ArrowRL, an RL-based post-training method that uses a reverse-video signal to incentivize AoT-aware responses. It couples a fidelity reward with a novel reverse reward within a GRPO framework, training on a curated, temporally rich dataset. To robustly evaluate AoT perception, the authors propose AoTBench, a three-task benchmark capturing sequence direction, directional caption matching, and AoT-sensitive VQA, along with a Temporal Divergence Score (TDS) for benchmarking sensitivity. Empirical results show ArrowRL markedly improves AoTBench performance and transfers to standard VQA benchmarks with substantial gains, while maintaining or improving performance on non-temporal video tasks, underscoring AoT awareness as a practical route to deeper temporal understanding in LMMs.

Abstract

The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.

Seeing the Arrow of Time in Large Multimodal Models

TL;DR

This work addresses the critical gap in temporal directionality understanding (AoT) in large multimodal models by introducing ArrowRL, an RL-based post-training method that uses a reverse-video signal to incentivize AoT-aware responses. It couples a fidelity reward with a novel reverse reward within a GRPO framework, training on a curated, temporally rich dataset. To robustly evaluate AoT perception, the authors propose AoTBench, a three-task benchmark capturing sequence direction, directional caption matching, and AoT-sensitive VQA, along with a Temporal Divergence Score (TDS) for benchmarking sensitivity. Empirical results show ArrowRL markedly improves AoTBench performance and transfers to standard VQA benchmarks with substantial gains, while maintaining or improving performance on non-temporal video tasks, underscoring AoT awareness as a practical route to deeper temporal understanding in LMMs.

Abstract

The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.

Paper Structure

This paper contains 41 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Arrow of Time (AoT) perception challenges, demonstrated by a strong representative base LMM (Qwen2.5-VL-7B bai2025qwen25). (a) Basic visual directionality (forward vs. reverse), trivial for humans, often confounds these models. (b) Deeper temporal insensitivity is also observed, where LMMs often generate the same description (e.g., "ignite") for events with opposite semantics based on temporal direction. We propose ArrowRL to instill AoT awareness for LMM temporal perception.
  • Figure 2: Missing the AoT: Multiple choice question (MCQ) accuracy of a strong representative LMM (LLaVA-OV-7B li2024llavaov) on standard VQA benchmarks xiao2021nextqafu2024videommepatraucean2023perceptionmangalam2023egoschemacai2024temporalbenchli2024vitatecsli2024mvbenchwu2024longvideobench when processing forward, shuffled, and reversed video sequences. S: short, L: long. The small or negligible performance drop across conditions highlights low temporal sensitivity, stemming from deficiencies in LMM capabilities and benchmark question design.
  • Figure 3: Temporal Sensitivity Analysis. (Left) Model Sensitivity: comparing MCQ accuracy on TVBench cores2024tvbench for various LMMs li2024mvbenchliu2024stxu2024pllavawang2024tarsierli2024llavaovzhang2024llavavideowang2024qwen2bai2025qwen25, on forward vs. reverse video sequences. LLaVA-OV-7B li2024llavaov, LLaVA-Video-7B zhang2024llavavideo, and Qwen2.5-VL-7B bai2025qwen25 demonstrate highest accuracy and AoT sensitivity. (Right) Benchmark Sensitivity: comparing the proposed temporal divergence score (TDS) for various VQA benchmarks li2024vitatecscai2024temporalbenchxiao2021nextqapatraucean2023perceptionfu2024videommezhang2024vinogroundliu2024tempcompasscores2024tvbench, along with relative accuracy drop (mean $\pm$ std.) calculated using forward vs. reverse videos. Benchmarks with higher scores (Vinoground zhang2024vinoground, TempCompass liu2024tempcompass, TVBench cores2024tvbench) are identified as temporally sensitive and better suited for evaluating temporal perception. S: Short, V: Video, T: Text.
  • Figure 3: Ablation results of ArrowRL. We report average performance across all columns of AoTBench tasks in Table \ref{['tab:result']}. ArrowRL greatly outperforms the SFT baseline trained on the same data, demonstrating the effectiveness of our RL approach. In addition, using our curated high-temporality post-training data provides a performance gain, validating our data selection strategy.
  • Figure 4: Illustrative low vs. high TDS VQA examples, sourced from PerceptionTest patraucean2023perception, NExT-QA xiao2021nextqa and TempCompass liu2024tempcompass. Samples with high TDS necessitate AoT reasoning, whereas the low-TDS sample can be solved without understanding video temporal progression.
  • ...and 9 more figures