Table of Contents
Fetching ...

LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

Leqi Shen, Tao He, Guoqiang Gong, Fan Yang, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding

TL;DR

LLaVA-MLB addresses the challenge of training-free video LLMs by tackling attention bias in Image LLMs when compressing video tokens. It introduces SegAttn, a two-stage framework that compresses and then expands token sequences, augmented by Gridded Attention Pooling to preserve spatiotemporal structure and Visual Summarization Tail to capture global context from attention tails. The approach achieves state-of-the-art results among training-free methods on open-ended VideoQA and multiple-choice tasks while reducing pre-filling time, demonstrated across 7B and 34B LLMs. This work enables detailed video understanding with frozen Image LLMs, offering practical efficiency and accuracy benefits for edge devices and real-time applications.

Abstract

Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.

LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

TL;DR

LLaVA-MLB addresses the challenge of training-free video LLMs by tackling attention bias in Image LLMs when compressing video tokens. It introduces SegAttn, a two-stage framework that compresses and then expands token sequences, augmented by Gridded Attention Pooling to preserve spatiotemporal structure and Visual Summarization Tail to capture global context from attention tails. The approach achieves state-of-the-art results among training-free methods on open-ended VideoQA and multiple-choice tasks while reducing pre-filling time, demonstrated across 7B and 34B LLMs. This work enables detailed video understanding with frozen Image LLMs, offering practical efficiency and accuracy benefits for edge devices and real-time applications.

Abstract

Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.

Paper Structure

This paper contains 18 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An example from ANet-QA yu2019activitynet illustrates the positional attention bias. Out of the 2880 tokens from 5 sampled frames ($5\times 24 \times 24$), the 720 tokens with the highest attention scores are mostly concentrated in the last frame.
  • Figure 2: The average attention scores on ANet-QA. For each video, tokens from 5 sampled frames ($5\times 24 \times 24$) are input into the LLM. The attention score is computed using the last query token at $3^{\rm th}$ layer. Brighter areas correspond to higher attention values. (a) shows five sequential frames, and (b) shows them in reverse. In (c), a single frame is sampled and repeated five times.
  • Figure 3: Illustration of LLaVA-MLB. We employ the two-stage SegAttn framework to extend the visual sequence. GAPool is introduced to mitigate attention bias, improving token compression. VSTail leverages attention bias to enhance overall video understanding.
  • Figure 4: Ablation study on GAPool in LLaVA-MLB$^{G}$ using a 5-frame segment on ANet-QA. The red dashed line indicates the accuracy without compression. The number of the tokens fed into stage two is denoted by # input tokens. In GAPool, 1440, 960, 720, 480, 320, 180, and 80 indicate grid sizes of 1$\times$2, 1$\times$3, 2$\times$2, 2$\times$3, 3$\times$3, 4$\times$4, and 6$\times$6.