Table of Contents
Fetching ...

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie

TL;DR

This work tackles the challenge of heavy video token costs in multi-modal large language models by introducing HICom, a conditional, hybrid-level video token compression strategy. HICom injects instruction information at both local (grouped token attention with direct injection) and global (learnable tokens with coarse injection) levels to retain instruction-relevant content while reducing tokens, and adds a conditional pre-training stage with the HICom-248K dataset. The approach achieves state-of-the-art or competitive results on five video benchmarks with substantially fewer visual tokens, including a 2.43% average improvement on three multiple-choice QA tests and a 78.8% reduction in tokens versus previous SOTA methods. This work advances efficient video understanding in MLLMs and opens avenues for longer-video processing and extension to image modalities.

Abstract

Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

TL;DR

This work tackles the challenge of heavy video token costs in multi-modal large language models by introducing HICom, a conditional, hybrid-level video token compression strategy. HICom injects instruction information at both local (grouped token attention with direct injection) and global (learnable tokens with coarse injection) levels to retain instruction-relevant content while reducing tokens, and adds a conditional pre-training stage with the HICom-248K dataset. The approach achieves state-of-the-art or competitive results on five video benchmarks with substantially fewer visual tokens, including a 2.43% average improvement on three multiple-choice QA tests and a 78.8% reduction in tokens versus previous SOTA methods. This work advances efficient video understanding in MLLMs and opens avenues for longer-video processing and extension to image modalities.

Abstract

Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.

Paper Structure

This paper contains 19 sections, 4 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: An example of the video understanding task, and the comparison between the unconditional compression and our proposed conditional compression with hybrid-level instruction injection. We inject instruction at both local and global levels, guiding the compression to retain the maximum amount of user-focused information and minimize the computational burden.
  • Figure 2: The framework of our proposed HICom. We propose the hybrid-level instruction injection to conditionally compress video tokens in MLLMs. We extract instruction-relevant information within each grouped sub-region at the local level, and extract it to a fixed number of tokens at the global level. The instruction condition is injected into the attention process to guide the compression.
  • Figure 3: We introduce a new guidance pre-training stage and implement three-stage training for conditional compression.
  • Figure 4: The visualization of data source (left) and video length (right) of our constructed HICom-248K dataset.
  • Figure 5: The ablation study on different compressing ratios. The figures show the performance on VideoMME-Short (upper left), MVBench (upper right), EgoSchema (lower left), and the inference time of 7B LLM (lower right).
  • ...and 4 more figures