Table of Contents
Fetching ...

PAVE: Patching and Adapting Video Large Language Models

Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li

TL;DR

PAVE addresses the challenge of adapting pre-trained Video LLMs to tasks that incorporate side-channel signals (e.g., audio, 3D data, or multi-view video) by introducing patch-based adapters that fuse side-channel tokens with video tokens through a temporal-aligned cross-attention mechanism. The patch, together with a LoRA augmentation, updates the video representation without altering the base model, incurring about $0.1\%$ extra parameters and FLOPs. Empirically, PAVE achieves state-of-the-art results on audio-visual QA (AVSD, AVQA, Music-AVQA) and 3D QA (ScanQA, SQA3D), while also improving enhanced and multi-view video understanding tasks, with strong generalization across different Video LLMs and model scales. The approach enables multi-task learning on patches and supports practical deployment via compact patches (e.g., ~20 MB) that can be distributed alongside large base models, advancing scalable, multi-modal reasoning in video-centric AI systems.

Abstract

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

PAVE: Patching and Adapting Video Large Language Models

TL;DR

PAVE addresses the challenge of adapting pre-trained Video LLMs to tasks that incorporate side-channel signals (e.g., audio, 3D data, or multi-view video) by introducing patch-based adapters that fuse side-channel tokens with video tokens through a temporal-aligned cross-attention mechanism. The patch, together with a LoRA augmentation, updates the video representation without altering the base model, incurring about extra parameters and FLOPs. Empirically, PAVE achieves state-of-the-art results on audio-visual QA (AVSD, AVQA, Music-AVQA) and 3D QA (ScanQA, SQA3D), while also improving enhanced and multi-view video understanding tasks, with strong generalization across different Video LLMs and model scales. The approach enables multi-task learning on patches and supports practical deployment via compact patches (e.g., ~20 MB) that can be distributed alongside large base models, advancing scalable, multi-modal reasoning in video-centric AI systems.

Abstract

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

Paper Structure

This paper contains 22 sections, 3 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: (a) Overview of PAVE. PAVE presents a simple, parameter-efficient adapter to integrate videos and side-channel signals. This is done by fusing side-channel tokens $\mathbf{z}^s$ and video tokens $\mathbf{z}^v$, and further adding the results to the original video tokens $\mathbf{z}^v$. (b) Details of PAVE's fusion function. The fusion function $g(\cdot)$ consists of a few blocks of temporal-aligned cross-attention layer, MLP, and layer normalization. (c) Temporal-aligned Cross-Attention. Visual tokens $\mathbf{z}^v$ and side-channel tokens $\mathbf{z}^s$ are aligned along the temporal axis. A video token $\mathbf{z}^v(k)$ is treated as query, and only attends to keys and values (defined by side-channel tokens) in its temporal neighborhood.
  • Figure 2: Visualization of sample results. We visualize the compare the results from our base model LLaVA-OneVision (under zero-shot inference) and PAVE across 3D QA and audio-visual QA tasks. Both succeful and failure cases are shown.
  • Figure 3: Visualization of cross-attention scores in PAVE when injecting high frame rate videos as the side-channel. Cross-attention scores are calculated between the selected video tokens from the original low frame rate video (red cells on the left) and side-channel tokens from the high frame rate video. Scores are displayed as heatmaps over densely sampled video frames (on the right).
  • Figure 4: Visualization of the QA results on enhanced video QA task. By making use of the video feature of the densely sampled video frames, PAVE captures more details in the video and thus improves the performance of video understanding.