LVCHAT: Facilitating Long Video Comprehension

Yu Wang; Zeyuan Zhang; Julian McAuley; Zexue He

LVCHAT: Facilitating Long Video Comprehension

Yu Wang, Zeyuan Zhang, Julian McAuley, Zexue He

TL;DR

LVChat tackles long-video understanding for multimodal LLMs by introducing Frame-Scalable Encoding (FSE), which maps each video clip of $K$ frames into $N$ embeddings and forms $E_{FSE} \in \mathbb{R}^{(nN) \times d}$ with $n = \lceil T / K \rceil$, and Interleaved Frame Encoding (IFE), which uses a interleaving factor $\gamma$ to handle inputs longer than seen during training. The approach aligns video embeddings with the LLM's token space and fine-tunes the backbone on the scalable representations. Experiments on long-video QA and captioning show up to $27\%$ accuracy improvement over baselines on long videos and strong results on real-world datasets, demonstrating practical impact for long-form video understanding. This work provides a scalable blueprint for enabling robust long-video comprehension in multimodal LLMs and highlights promising directions for leveraging longer training data and larger models.

Abstract

Enabling large language models (LLMs) to read videos is vital for multimodal LLMs. Existing works show promise on short videos whereas long video (longer than e.g.~1 minute) comprehension remains challenging. The major problem lies in the over-compression of videos, i.e., the encoded video representations are not enough to represent the whole video. To address this issue, we propose Long Video Chat (LVChat), where Frame-Scalable Encoding (FSE) is introduced to dynamically adjust the number of embeddings in alignment with the duration of the video to ensure long videos are not overly compressed into a few embeddings. To deal with long videos whose length is beyond videos seen during training, we propose Interleaved Frame Encoding (IFE), repeating positional embedding and interleaving multiple groups of videos to enable long video input, avoiding performance degradation due to overly long videos. Experimental results show that LVChat significantly outperforms existing methods by up to 27\% in accuracy on long-video QA datasets and long-video captioning benchmarks. Our code is published at https://github.com/wangyu-ustc/LVChat.

LVCHAT: Facilitating Long Video Comprehension

TL;DR

LVChat tackles long-video understanding for multimodal LLMs by introducing Frame-Scalable Encoding (FSE), which maps each video clip of

frames into

embeddings and forms

with

, and Interleaved Frame Encoding (IFE), which uses a interleaving factor

to handle inputs longer than seen during training. The approach aligns video embeddings with the LLM's token space and fine-tunes the backbone on the scalable representations. Experiments on long-video QA and captioning show up to

accuracy improvement over baselines on long videos and strong results on real-world datasets, demonstrating practical impact for long-form video understanding. This work provides a scalable blueprint for enabling robust long-video comprehension in multimodal LLMs and highlights promising directions for leveraging longer training data and larger models.

Abstract

Paper Structure (36 sections, 7 equations, 6 figures, 10 tables)

This paper contains 36 sections, 7 equations, 6 figures, 10 tables.

Introduction
Related Work
Long Context Modeling
Video Question Answering
Enabling LLMs to Process Videos through Descriptive Textualization
Enabling LLMs to Process Videos via Adapters
Method
Preliminary
Frame-Scalable Encoding
Interleaved Frame Encoding
Experiments
Implementation Details
Experimental Setups
Overall Performance Comparison
Ablation Study of LVChat
...and 21 more sections

Figures (6)

Figure 1: Previous video language models may suffer from over-compression for long video modeling (e.g., $T>60$s ) since a limited number of video tokens are used in LMs. In contrast, LVChat demonstrates superior performance on long videos by modeling more video tokens.
Figure 2: Illustration of Frame-Scalable Encoding. The process begins by segmenting the video into several clips. Subsequently, each clip is transformed into a set of $N$ embeddings. These embeddings are then concatenated sequentially, forming a comprehensive input stream for the Large Language Model (LLM).
Figure 3: Illustration of Interleaved Frame Encoding (IFE). We show the example with interleaving factor $\gamma$ being two. We first split the whole video into $\gamma$ groups. Then we convert each part into embeddings separately. With all the embeddings, we interleave them with every $\gamma$ embeddings sharing the same positional embedding.
Figure 4: Average accuracies w.r.t different video lengths. "26" is the average duration of videos across four datasets. The IFE technique is not applied when videos are of lengths 26 and 100.
Figure 5: Accuracies w.r.t. the number of tokens
...and 1 more figures

LVCHAT: Facilitating Long Video Comprehension

TL;DR

Abstract

LVCHAT: Facilitating Long Video Comprehension

Authors

TL;DR

Abstract

Table of Contents

Figures (6)