Table of Contents
Fetching ...

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang

TL;DR

The paper tackles long video understanding for multimodal LLMs by introducing TimeSuite, which couples efficient long-sequence processing (Token Shuffle and TAPE) with grounding-centric instruction tuning (TimePro) and a Temporal Grounded Caption task. VideoChat-T, built on VideoChat2, attains strong zero-shot temporal grounding and competitive long-video QA, and after temporal grounding fine-tuning rivals supervised expert models. The TimePro dataset and Temporal Grounded Caption task regularize learning, reducing hallucinations and improving segment-level alignment. Overall, TimeSuite provides an effective, scalable approach to unlock temporal reasoning in existing short-form MLLMs, with practical benefits for long-video QA and grounding tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

TL;DR

The paper tackles long video understanding for multimodal LLMs by introducing TimeSuite, which couples efficient long-sequence processing (Token Shuffle and TAPE) with grounding-centric instruction tuning (TimePro) and a Temporal Grounded Caption task. VideoChat-T, built on VideoChat2, attains strong zero-shot temporal grounding and competitive long-video QA, and after temporal grounding fine-tuning rivals supervised expert models. The TimePro dataset and Temporal Grounded Caption task regularize learning, reducing hallucinations and improving segment-level alignment. Overall, TimeSuite provides an effective, scalable approach to unlock temporal reasoning in existing short-form MLLMs, with practical benefits for long-video QA and grounding tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

Paper Structure

This paper contains 36 sections, 1 equation, 9 figures, 17 tables, 1 algorithm.

Figures (9)

  • Figure 1: VideoChat-T demonstrates high performance for both long-form video question answering and temporal grounding. Our TimeSuite presents a collection of new designs to enhance the long video understanding capability of MLLMs. It will implicitly endow the MLLM with ability of correctly attending the visual segments when generating answers, thus relieving the hallucinations.
  • Figure 2: Overall Architecture of VideoChat-T. First, long videos are segmented into clips, which are then transformed into feature embeddings by video encoder and time-aware Qformer. Next, all visual tokens undergo Token Shuffle to compress overly long tokens, and generate adaptive positional encodings through TAPE. Finally, the long video tokens are concatenated with the user query, serving as the input of LLM, thereby generating appropriate responses.
  • Figure 3: (a) The proposed temporal centric instruction-tuning dataset, TimePro. This dataset contains approximately 349K high-quality and strongly temporally correlated data. (b) The proposed Temporal Grounded Caption fine-tuning data paradigm. It effectively reducing the occurrence of hallucinations. We employ a 4-stage processing pipeline to ensure the quality of the generated data.
  • Figure 4: Qualitative comparison between VideoChat-T and other methods. VideoChat-T not only possesses temporal fine-grained perception capabilities but also can perform accurate long video reasoning. Green text indicates correct answers, while red text indicates inappropriate answers.
  • Figure 5: Performance of VideoChat-T with varying input frame numbers. As the number of input frames increases, the performance of VideoChat-T shows an upward trend in both long video QA and temporal grounding tasks. Due to the over low temporal grounding performance of VideoChat2, its curve is omitted.
  • ...and 4 more figures