Table of Contents
Fetching ...

Temporal Reasoning Transfer from Text to Video

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, Qi Liu

TL;DR

This diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy, and introduces the Textual Temporal reasoning Transfer (T3), validating the efficacy of transferring temporal reasoning abilities from text to video domains.

Abstract

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

Temporal Reasoning Transfer from Text to Video

TL;DR

This diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy, and introduces the Textual Temporal reasoning Transfer (T3), validating the efficacy of transferring temporal reasoning abilities from text to video domains.

Abstract

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.
Paper Structure (31 sections, 9 figures, 20 tables, 5 algorithms)

This paper contains 31 sections, 9 figures, 20 tables, 5 algorithms.

Figures (9)

  • Figure 1: Two popular Video LLMs struggle with basic temporal reasoning (left). We mitigate this issue via textual temporal transfer (middle), which demonstrates consistent improvement (right).
  • Figure 2: Example of videos and questions that focus on different temporal reasoning abilities.
  • Figure 3: Temporal probing results for LongVA (upper) and VILA (lower). The probe on visual representations achieve $>90$ accuracy in most cases, while the LLM decoders still have large room for improvement even with textual inputs, leading to the poor temporal understanding ability of Video LLMs. Detailed results of the sub-categories are reported in Appendix \ref{['app:probing_results']}.
  • Figure 4: Temporal-oriented question-answering pairs with textual image captions as context.
  • Figure 5: Textual temporal reasoning accuracy correlates positively with video understanding results on three benchmarks.
  • ...and 4 more figures