Table of Contents
Fetching ...

Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian

TL;DR

The paper addresses temporal logic inconsistency in Video-LLMs by uncovering cross-modal attention discriminability as a core factor. Through an interpretability-driven analysis, it introduces Temporally Conditioned Attention Sharpening (TCAS), a training objective that sharpens attention distributions over temporal segments without adding modules. TCAS improves temporal consistency and video temporal grounding across multiple backbones and datasets, and generalizes to tasks like EOJ, supported by causal interventions and ablations. The findings highlight temporal discriminability of attention heads as a bottleneck in temporal understanding and demonstrate practical gains with broad applicability in video-language modeling.

Abstract

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

TL;DR

The paper addresses temporal logic inconsistency in Video-LLMs by uncovering cross-modal attention discriminability as a core factor. Through an interpretability-driven analysis, it introduces Temporally Conditioned Attention Sharpening (TCAS), a training objective that sharpens attention distributions over temporal segments without adding modules. TCAS improves temporal consistency and video temporal grounding across multiple backbones and datasets, and generalizes to tasks like EOJ, supported by causal interventions and ablations. The findings highlight temporal discriminability of attention heads as a bottleneck in temporal understanding and demonstrate practical gains with broad applicability in video-language modeling.

Abstract

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

Paper Structure

This paper contains 40 sections, 7 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of our work. We first analyze to gain conclusions about the factors influencing temporal understanding logic consistency in Video-LLMs, and then propose a method to enhance the temporal consistency based on these conclusions.
  • Figure 2: The distribution of cross-modal attention heads in TimeChat. The x-axis represents the attention layer index, and the y-axis represents the head index.
  • Figure 3: Visualization of attention score distributions for key head $A^{14,3}$ across various samples. The start and end frames of the query event are marked in red at the corresponding token positions.
  • Figure 4: Violin plot showing the distribution of attention discriminability scores for different ranges of consistency scores. $\mu$, m, n denote the mean, median, and number of the discriminability score distribution, respectively.
  • Figure 5: Violin plot of analysis results on EOJ task.
  • ...and 10 more figures