Table of Contents
Fetching ...

Efficient Multi-modal Large Language Models via Visual Token Grouping

Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng

TL;DR

This paper tackles the computational burden of multi-modal LLMs when processing high-resolution visuals by introducing VisToG, a token grouping mechanism that leverages pre-trained vision encoders to cluster image tokens into semantic units prior to LLM processing. A grouping layer constructs VIS tokens from semantic tokens and image segments, while isolated attention prevents semantic tokens from perturbing the original image representations, enabling efficient, instruction-aware token reduction. The training protocol consists of a two-stage process that first aligns image features with the LLM and then jointly tunes the grouping layer, the lightweight visual connector, and the LLM to achieve instruction-aware grouping. Empirically, VisToG preserves 98.1% of the original performance while reducing inference time by over 27%, demonstrating a practical path to deploying high-resolution MLLMs in resource-constrained settings and highlighting the potential for extending token reduction to video data.

Abstract

The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.

Efficient Multi-modal Large Language Models via Visual Token Grouping

TL;DR

This paper tackles the computational burden of multi-modal LLMs when processing high-resolution visuals by introducing VisToG, a token grouping mechanism that leverages pre-trained vision encoders to cluster image tokens into semantic units prior to LLM processing. A grouping layer constructs VIS tokens from semantic tokens and image segments, while isolated attention prevents semantic tokens from perturbing the original image representations, enabling efficient, instruction-aware token reduction. The training protocol consists of a two-stage process that first aligns image features with the LLM and then jointly tunes the grouping layer, the lightweight visual connector, and the LLM to achieve instruction-aware grouping. Empirically, VisToG preserves 98.1% of the original performance while reducing inference time by over 27%, demonstrating a practical path to deploying high-resolution MLLMs in resource-constrained settings and highlighting the potential for extending token reduction to video data.

Abstract

The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.

Paper Structure

This paper contains 18 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of of our proposed VisToG. Semantic tokens are concatenated with the image patch tokens after linear projection and fed into the pre-trained vision encoder. Before the visual projector to LLM, a grouping layer is applied to group similar image segment tokens into semantically abstraction tokens of image. Besides, isolated attention is applied to ensure a better abstraction.
  • Figure 2: (a) Structure of the grouping layer. (b) Comparison of Inference time and Average Performance between different models.
  • Figure 3: Visualization of the image tokens selected of the LLaVA-rand. The instruction is "What is the main focus of the image?". Response from LLaVA: "The main focus of the image is a cat sitting on a desk in front of a laptop computer".
  • Figure 4: Performance Comparison between standard attention and isolated attention. The numbers are the relative performance compared to the baseline.
  • Figure 5: (a) Ablation on the number of image tokens on the POPE dataset. (b) Ablation on the number of image tokens on GQA dataset.