Table of Contents
Fetching ...

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, Zilong Zheng

TL;DR

This work addresses the bottlenecks of long-form video-language understanding by introducing Temporal Grounding Bridge (TGB), which enables efficient temporal grounding and context expansion through a low-dimensional, flow-based grounding backbone. It combines three innovations: (i) an efficient multi-span temporal grounding mechanism on low-dim temporal features, (ii) a length-extrapolation training paradigm with extrapolative position encoding to extend context windows, and (iii) a bootstrapping framework that plugs into pluggable MLLMs without requiring temporal grounding annotations. Across seven benchmarks and multiple MLLMs, TGB yields notable gains in long-form video QA and temporal grounding while maintaining computational efficiency, even enabling zero-shot generalization to longer sequences. The approach demonstrates strong generality and paves the way for practical video chat agents and scalable video-language systems, though it acknowledges remaining gaps in fine-grained temporal grounding.

Abstract

Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16 longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available at https://github.com/bigai-nlco/VideoTGB

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

TL;DR

This work addresses the bottlenecks of long-form video-language understanding by introducing Temporal Grounding Bridge (TGB), which enables efficient temporal grounding and context expansion through a low-dimensional, flow-based grounding backbone. It combines three innovations: (i) an efficient multi-span temporal grounding mechanism on low-dim temporal features, (ii) a length-extrapolation training paradigm with extrapolative position encoding to extend context windows, and (iii) a bootstrapping framework that plugs into pluggable MLLMs without requiring temporal grounding annotations. Across seven benchmarks and multiple MLLMs, TGB yields notable gains in long-form video QA and temporal grounding while maintaining computational efficiency, even enabling zero-shot generalization to longer sequences. The approach demonstrates strong generality and paves the way for practical video chat agents and scalable video-language systems, though it acknowledges remaining gaps in fine-grained temporal grounding.

Abstract

Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16 longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available at https://github.com/bigai-nlco/VideoTGB
Paper Structure (45 sections, 3 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 45 sections, 3 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Training Efficiency and Length Extrapolation of TGB. A. Results of parameters on AGQA GrundeMcLaughlin2021AGQA Our method demonstrates the best performance with less trainable parameters. B. Results of frame extrapolation on EgoSchema egoschema under zero-shot setting. T-$num$ indicates the number of training context window size. By training with four-frame videos, our model shows consistent performance on extended video length.
  • Figure 2: Overview of TGB framework (BLIP-based). The Temporal Grounding Bridge ($\S$\ref{['subsec:tps']}) is designed to capture temporal priors as well as the specific moments in a video that are grounded by language. We further develop a pluggable bootstraping framework ($\S$\ref{['subsec:joint']}) that incorporates TGB-MLLM alignment, utilizing a joint optimization strategy.
  • Figure 3: Comparison of multi-span RC prediction (d) and other methods (a-c) in terms of time and space complexity.
  • Figure 4: Inference time Analysis
  • Figure 5: Qualitative results on temporal grounding
  • ...and 4 more figures