Table of Contents
Fetching ...

InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, Huo Cao

TL;DR

InternVL-X tackles the heavy cost of visual tokens in multimodal LLMs by introducing three compression modules PVTC, LVTC, and RVTC that compress tokens at different stages and resolutions. PVTC uses dual-queries for local-global cross-attention; LVTC compresses tokens early and expands later with residual connections; RVTC optimizes high-resolution slicing via area- or edge-based matching. Together they achieve state-of-the-art results on 7 public MLLM benchmarks and improve efficiency, using 20% or fewer visual tokens with minimal performance loss. This work demonstrates that joint token compression across projection, LLM layers, and data-level slicing yields substantial speedups without sacrificing accuracy.

Abstract

Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.

InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression

TL;DR

InternVL-X tackles the heavy cost of visual tokens in multimodal LLMs by introducing three compression modules PVTC, LVTC, and RVTC that compress tokens at different stages and resolutions. PVTC uses dual-queries for local-global cross-attention; LVTC compresses tokens early and expands later with residual connections; RVTC optimizes high-resolution slicing via area- or edge-based matching. Together they achieve state-of-the-art results on 7 public MLLM benchmarks and improve efficiency, using 20% or fewer visual tokens with minimal performance loss. This work demonstrates that joint token compression across projection, LLM layers, and data-level slicing yields substantial speedups without sacrificing accuracy.

Abstract

Most multimodal large language models (MLLMs) treat visual tokens as "a sequence of text", integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.

Paper Structure

This paper contains 15 sections, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between InternVL-X and other models. # indicates the model is not an official checkpoint, but a version we retrained.
  • Figure 2: Architecture of the proposed InternVL-X incorporates three components: PVTC, LVTC, and RVTC. PVTC employs dual cross-attention on local and global queries to efficiently compress tokens. LVTC initially compresses visual tokens and subsequently expands them to improve their utilization across different LLM layers. RVTC optimizes image slicing to reduce the visual tokens numbers.
  • Figure 3: Comparison of different projectors.
  • Figure 4: Viualization of average attention weights of each token in the LLM process. The horizontal axis is the key position and the vertical axis is the query position.
  • Figure 5: LVTC uses a high resolution projector and multi-projector structure to enhance visual information in LLM.
  • ...and 4 more figures