Table of Contents
Fetching ...

On the Consistency of Video Large Language Models in Temporal Comprehension

Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao

TL;DR

This work addresses the instability of temporal comprehension in Video-LLMs by introducing Charades-CON and ActivityNet-CON to evaluate grounding and verification consistency via tailored probes. It reveals pervasive inconsistencies across widely used models and prompts, showing that improvements from prompting or standard instruction tuning are often unreliable. The authors propose VTune, an event-temporal verification tuning method that explicitly targets consistency and grounding, yielding substantial gains on both tasks across multiple models and datasets. The results advance trustworthy temporal understanding in Video-LLMs and provide a practical framework and data for future research in robust video-grounded reasoning.

Abstract

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code are open-sourced at https://github.com/minjoong507/Consistency-of-Video-LLM.

On the Consistency of Video Large Language Models in Temporal Comprehension

TL;DR

This work addresses the instability of temporal comprehension in Video-LLMs by introducing Charades-CON and ActivityNet-CON to evaluate grounding and verification consistency via tailored probes. It reveals pervasive inconsistencies across widely used models and prompts, showing that improvements from prompting or standard instruction tuning are often unreliable. The authors propose VTune, an event-temporal verification tuning method that explicitly targets consistency and grounding, yielding substantial gains on both tasks across multiple models and datasets. The results advance trustworthy temporal understanding in Video-LLMs and provide a practical framework and data for future research in robust video-grounded reasoning.

Abstract

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code are open-sourced at https://github.com/minjoong507/Consistency-of-Video-LLM.

Paper Structure

This paper contains 24 sections, 2 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Top: An example of inconsistent behavior of Video-LLMs, where the answer contradicts the initial temporal prediction in self-verification. Bottom: We reveal that most Video-LLMs struggle to reliably confirm their initial moment predictions, achieving a near chance-level consistency (50%).
  • Figure 2: Illustration of our consistency evaluation process. For each query-moment pair in the video, we shift the ground-truth moment to a different moment and prompt GPT-4o-mini to generate aligned, misaligned, and compositional queries. We measure consistency as an IoU for grounding probes and design a GPT-based evaluation to assess the model's response for verification probes.
  • Figure 3: Examples of the model responses for verification probes. We first ask the model to predict the timestamp of the given sentence, then query it based on its own predictions. For holistic and compositional verifications, we replace the $m$ in the questions with each model's moment prediction. The red text indicates misaligned queries or highlights inconsistent model responses.
  • Figure 4: Consistency evaluation of Video-LLMs using different prompting methods. The Standard indicates the original performance. The highest improvement is highlighted in red for Chain-of-Thought prompting and in blue for Description prompting.
  • Figure 5: Visualization of instruction tuning methods. The blue text represents content aligned with the meaning of original content, while the red text indicates irrelevant content. These colors also apply to the corresponding responses. While IT only requires a timestamp for the given query, VTune prompts the model to recognize temporal and content changes and respond with corrections.
  • ...and 7 more figures