Table of Contents
Fetching ...

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

Jiafeng Liang, Shixin Jiang, Xuan Dong, Ning Wang, Zheng Chu, Hui Su, Jinlan Fu, Ming Liu, See-Kiong Ng, Bing Qin

TL;DR

This work reveals that large multimodal models often fail to robustly reason about video temporality, instead leaning on priors or textual cues when video content clashes with prompts. It introduces TemRobBench, a temporal robustness benchmark with intrinsic and extrinsic perturbations across visual and textual modalities, and evaluates 16 mainstream LMMs, uncovering widespread shortcut behavior. To counter this, the authors propose PanoDPO, a panoramic direct preference optimization framework that adds video- and question-conditioned preference learning to standard DPO, guiding models to attend to both visual and linguistic signals. Empirical results show that PanoDPO significantly improves temporal robustness (e.g., true accuracy and flip-rate metrics) while preserving general video understanding, offering a pathway toward more reliable multimodal temporal analysis.

Abstract

Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

TL;DR

This work reveals that large multimodal models often fail to robustly reason about video temporality, instead leaning on priors or textual cues when video content clashes with prompts. It introduces TemRobBench, a temporal robustness benchmark with intrinsic and extrinsic perturbations across visual and textual modalities, and evaluates 16 mainstream LMMs, uncovering widespread shortcut behavior. To counter this, the authors propose PanoDPO, a panoramic direct preference optimization framework that adds video- and question-conditioned preference learning to standard DPO, guiding models to attend to both visual and linguistic signals. Empirical results show that PanoDPO significantly improves temporal robustness (e.g., true accuracy and flip-rate metrics) while preserving general video understanding, offering a pathway toward more reliable multimodal temporal analysis.

Abstract

Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.

Paper Structure

This paper contains 27 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example of the Intrinsic Temporal Shortcut (b) and Extrinsic Temporal Shortcut (c). The model tends to excessively rely on prior knowledge or textual context when temporal inconsistencies arise between video content and common sense or text prompt.
  • Figure 2: (a) Response distribution when asking the question within temporal inconsistencies. The majority of errors stem from shortcuts. (b) The model discriminative ability on the correct and shortcut answer is represented by the difference in log-likelihoods.
  • Figure 3: Overview of the TemRobBench. The benchmark emphasizes evaluating the model's robustness against temporal inconsistency, especially take intrinsic shortcuts (over-reliance on prior knowledge) and extrinsic shortcuts (over-reliance on textual context). We construct inconsistencies with knowledge and textual context by shuffling video clips and event descriptions, and design corresponding shortcut answers to verify the evidence of the response.
  • Figure 4: Comprehensive statistics from different perspectives (left) and detailed inconsistency perturbation classes (right) in the TemRobBench.
  • Figure 5: Accuracy (left) and flip rate (right) of two different aspects (i.e., blue bars and red bars represent intrinsic and extrinsic temporal shortcuts, respectively) for each perturbation classes. The stripe pattern denotes performance drop due to the temporal inconsistencies.
  • ...and 3 more figures