PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li
TL;DR
PAVE addresses the challenge of adapting pre-trained Video LLMs to tasks that incorporate side-channel signals (e.g., audio, 3D data, or multi-view video) by introducing patch-based adapters that fuse side-channel tokens with video tokens through a temporal-aligned cross-attention mechanism. The patch, together with a LoRA augmentation, updates the video representation without altering the base model, incurring about $0.1\%$ extra parameters and FLOPs. Empirically, PAVE achieves state-of-the-art results on audio-visual QA (AVSD, AVQA, Music-AVQA) and 3D QA (ScanQA, SQA3D), while also improving enhanced and multi-view video understanding tasks, with strong generalization across different Video LLMs and model scales. The approach enables multi-task learning on patches and supports practical deployment via compact patches (e.g., ~20 MB) that can be distributed alongside large base models, advancing scalable, multi-modal reasoning in video-centric AI systems.
Abstract
Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.
