Table of Contents
Fetching ...

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

Yunxin Li, Xinyu Chen, Baotain Hu, Min Zhang

TL;DR

The paper tackles long-video question answering by introducing Interactive Visual Adapter (IVA), which enables instruction-aware, fine-grained visual interactions within LLMs. It combines efficient long-video token production from frames with a causal transformer and integrates a lightweight, multi-layer IVA (comprising a frame selector and spatial interactor) into LLMs. Two-stage training—WebVid pretraining and video-instruction tuning with LoRA—yields state-of-the-art performance on most long-video QA benchmarks and strong results on short videos, with ablations confirming the value of IVA and its interaction settings. Qualitative case studies illustrate IVA’s ability to enable precise frame-level recognition and reasoning in long videos. The approach offers practical improvements in long-video understanding while highlighting remaining challenges in extremely long content and data balance.

Abstract

Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing large language models (LLMs) for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA, which contains a lightweight temporal frame selector and a spatial feature interactor, within the internal blocks of LLMs to capture instruction-aware and fine-grained visual signals. Consequently, the proposed video-LLM facilitates a comprehensive understanding of long video content through appropriate long video modeling and precise visual interactions. We conducted extensive experiments on nine video understanding benchmarks and experimental results show that our interactive visual adapter significantly improves the performance of video LLMs on long video QA tasks. Ablation studies further verify the effectiveness of IVA in understanding long and short video.

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

TL;DR

The paper tackles long-video question answering by introducing Interactive Visual Adapter (IVA), which enables instruction-aware, fine-grained visual interactions within LLMs. It combines efficient long-video token production from frames with a causal transformer and integrates a lightweight, multi-layer IVA (comprising a frame selector and spatial interactor) into LLMs. Two-stage training—WebVid pretraining and video-instruction tuning with LoRA—yields state-of-the-art performance on most long-video QA benchmarks and strong results on short videos, with ablations confirming the value of IVA and its interaction settings. Qualitative case studies illustrate IVA’s ability to enable precise frame-level recognition and reasoning in long videos. The approach offers practical improvements in long-video understanding while highlighting remaining challenges in extremely long content and data balance.

Abstract

Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing large language models (LLMs) for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA, which contains a lightweight temporal frame selector and a spatial feature interactor, within the internal blocks of LLMs to capture instruction-aware and fine-grained visual signals. Consequently, the proposed video-LLM facilitates a comprehensive understanding of long video content through appropriate long video modeling and precise visual interactions. We conducted extensive experiments on nine video understanding benchmarks and experimental results show that our interactive visual adapter significantly improves the performance of video LLMs on long video QA tasks. Ablation studies further verify the effectiveness of IVA in understanding long and short video.
Paper Structure (18 sections, 5 equations, 2 figures, 6 tables)

This paper contains 18 sections, 5 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The overview of our framework employing LLMs to handle long video. While producing video tokens, we combine the global features and aggregated fine-grained features to represent a frame, allocating two tokens for each frame. The causal transformer is used to capture temporal relationships across frames and its output will be spliced with spatial feature sequence. The IVA will be inserted between blocks of LLMs to incorporate fine-grained visuals based on an understanding of the long video tokens, text instructions, and query tokens.
  • Figure 2: Five cases illustrate the comparative performances of our IVA Model and Baseline. Red and green words represent the inaccurate and accurate statements, respectively.