Table of Contents
Fetching ...

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie

TL;DR

The paper addresses the inefficiency of processing countless tokens in cropping-based multimodal document understanding by introducing a parameter-free, plug-and-play Token-level Correlation-guided Compressor. It jointly leverages patch-patch correlations to assess information density and CLS-patch correlations to sample informative tokens, enabling adaptive, per-sub-image compression while preserving essential information. The method, integrated with state-of-the-art mPLUG-DocOwl1.5, achieves substantial token reduction (average ~66%) with comparable performance across 10 datasets, illustrating practical efficiency gains. This approach enhances scalability for high-resolution documents in MLLMs and lays the groundwork for further end-to-end optimization and broader applicability in multimodal compression tasks.

Abstract

Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method that efficiently captures the most informative tokens by delving into the correlation between the [CLS] token and patch tokens. By integrating these strategies, we develop a plug-and-play adaptive compressor module that can be seamlessly incorporated into MLLMs utilizing cropping techniques. This module not only enhances the processing speed during training and inference but also maintains comparable performance. We conduct experiments with the SOTA document understanding model mPLUG-DocOwl1.5 and the effectiveness is demonstrated through extensive comparisons with other compression methods.

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

TL;DR

The paper addresses the inefficiency of processing countless tokens in cropping-based multimodal document understanding by introducing a parameter-free, plug-and-play Token-level Correlation-guided Compressor. It jointly leverages patch-patch correlations to assess information density and CLS-patch correlations to sample informative tokens, enabling adaptive, per-sub-image compression while preserving essential information. The method, integrated with state-of-the-art mPLUG-DocOwl1.5, achieves substantial token reduction (average ~66%) with comparable performance across 10 datasets, illustrating practical efficiency gains. This approach enhances scalability for high-resolution documents in MLLMs and lays the groundwork for further end-to-end optimization and broader applicability in multimodal compression tasks.

Abstract

Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method that efficiently captures the most informative tokens by delving into the correlation between the [CLS] token and patch tokens. By integrating these strategies, we develop a plug-and-play adaptive compressor module that can be seamlessly incorporated into MLLMs utilizing cropping techniques. This module not only enhances the processing speed during training and inference but also maintains comparable performance. We conduct experiments with the SOTA document understanding model mPLUG-DocOwl1.5 and the effectiveness is demonstrated through extensive comparisons with other compression methods.
Paper Structure (23 sections, 14 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Comparison between (a) existing pipeline for cropping-based high-resolution processing methods and (b) proposed method. We can adaptively retain informative tokens, making models more efficient.
  • Figure 2: The illustration of the proposed method. (a) The overall architecture. The Token-level Correlation-guided Compressor is inserted between vision encoder and vision-to-text, which comprises two branches, (b) global information mining branch and (c) local information mining branch.
  • Figure 3: Visualization of token similarity. We select tokens corresponding to visually repetitive patches and visualize the similarity between the selected tokens and others. It can be observed that visually repetitive patches exhibit a high degree of similarity between their corresponding tokens.
  • Figure 4: Visualization of the attention maps between the [CLS] token and patch tokens across different layers of CLIP-ViT-L. (a) Original input image. (b) The attention maps from layers 1 to 12. (c) The attention maps from layers 13 to 24.
  • Figure 5: Boxplot visualization of the compression ratio achieved by different compression methods across various datasets. The median numbers is presented adjacent to the boxes.
  • ...and 9 more figures