Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan
TL;DR
Video-LLMs are limited by fixed encoders and short context windows, hindering long-video understanding and costly retraining. The paper introduces INTP-Video-LLMs, a training-free interpolation strategy that combines video token rearrangement to exploit fixed encoders with a RoPE-based context-extension (NTK-aware) to scale the LLM window, plus post-training KV-cache compression for memory efficiency. Empirically, INTP-Video-LLMs yield improvements on zero-shot VQA benchmarks and open-ended/multiple-choice tasks without additional training, and ablations show gains with more frames up to a plateau around 64 frames. This work offers a practical plug-in pathway to extend Video-LLMs to longer videos while keeping computational and data costs low.
Abstract
Advancements in Large Language Models (LLMs) inspire various strategies for integrating video modalities. A key approach is Video-LLMs, which incorporate an optimizable interface linking sophisticated video encoders to LLMs. However, due to computation and data limitations, these Video-LLMs are typically pre-trained to process only short videos, limiting their broader application for understanding longer video content. Additionally, fine-tuning Video-LLMs to handle longer videos is cost-prohibitive. Consequently, it becomes essential to explore the interpolation of Video-LLMs under a completely training-free setting. In this paper, we first identify the primary challenges in interpolating Video-LLMs: (1) the video encoder and modality alignment projector are fixed, preventing the integration of additional frames into Video-LLMs, and (2) the LLM backbone is limited in its content length capabilities, which complicates the processing of an increased number of video tokens. To address these challenges, we propose a specific INTerPolation method for Video-LLMs (INTP-Video-LLMs). We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector. Furthermore, we introduce a training-free LLM context window extension method to enable Video-LLMs to understand a correspondingly increased number of visual tokens.
