Table of Contents
Fetching ...

Efficient Large Multi-modal Models via Visual Context Compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille

TL;DR

The paper identifies substantial redundancy in visual tokens within multi-modal LLMs and proposes Visual Context Compressor, a simple average-pooling module, to reduce token counts. It pairs this with LLaVolta, a staged training paradigm that gradually relaxes compression to preserve information, achieving efficiency gains without sacrificing accuracy. Across 13 image-language and video-language benchmarks, the approach reduces training time by ~16% and inference latency by ~24%, with last-stage compression delivering the best average performance. This work pioneers compression-based acceleration for MLLMs and demonstrates practical gains in both training and inference, including extension to video-language tasks.

Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a light and staged training scheme that incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly compression during training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.

Efficient Large Multi-modal Models via Visual Context Compression

TL;DR

The paper identifies substantial redundancy in visual tokens within multi-modal LLMs and proposes Visual Context Compressor, a simple average-pooling module, to reduce token counts. It pairs this with LLaVolta, a staged training paradigm that gradually relaxes compression to preserve information, achieving efficiency gains without sacrificing accuracy. Across 13 image-language and video-language benchmarks, the approach reduces training time by ~16% and inference latency by ~24%, with last-stage compression delivering the best average performance. This work pioneers compression-based acceleration for MLLMs and demonstrates practical gains in both training and inference, including extension to video-language tasks.

Abstract

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a light and staged training scheme that incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly compression during training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.
Paper Structure (16 sections, 3 equations, 3 figures, 10 tables)

This paper contains 16 sections, 3 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Visual tokens are redundant in MLLMs.Left: The accuracy of the LLaVA-1.5-7B liu2024visual model(without re-train) on the GQA hudson2019gqa benchmarks varies with different percentages of retained visual tokens. The $x$-axis represents the percentage of original visual tokens preserved after applying 1D average pooling with varying stride sizes $S$ applied in $i$-th Transformer layer. Right: Visual tokens receive less attention from the [ANS] token as we go deeper into its layers of LLaVA-1.5-7B model. These findings collectively suggest a significant redundancy within the visual tokens of the MLLMs.
  • Figure 2: Example of Visual Context Compressor in a multi-modal LLM.
  • Figure 3: Training & inference paradigm comparison for conventional setting (A) and LLaVolta (B). Meta framework of LLaVolta consists three training stages: Stage I with heavy visual compression; Stage II with light visual compression in deeper layer; Stage III with subtle compression with wider token window without loss of performance. This can accelerate the training and inference by 18+% while maintaining performance.