Table of Contents
Fetching ...

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin

TL;DR

DocKylin addresses the challenge of visual document understanding under high-resolution inputs by introducing pixel- and token-level slimming. Adaptive Pixel Slimming (APS) removes redundant pixels based on gradient information, while Dynamic Token Slimming (DTS) clusters visual tokens and aggregates nonessential ones, aided by a lightweight Swin-based encoder and a Qwen-7B-Chat LLM, with a flexible input budget up to 1728×1728. Experiments show DocKylin achieves state-of-the-art results on multiple VDU benchmarks and that APS/DTS reduce visual sequence length and optimize training time, with DTS offering additional gains via similarity-weighted aggregation. The approach provides an efficient, modular solution for high-density document understanding that can be readily integrated into existing MLLMs to boost performance in real-world document analysis tasks.

Abstract

Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

TL;DR

DocKylin addresses the challenge of visual document understanding under high-resolution inputs by introducing pixel- and token-level slimming. Adaptive Pixel Slimming (APS) removes redundant pixels based on gradient information, while Dynamic Token Slimming (DTS) clusters visual tokens and aggregates nonessential ones, aided by a lightweight Swin-based encoder and a Qwen-7B-Chat LLM, with a flexible input budget up to 1728×1728. Experiments show DocKylin achieves state-of-the-art results on multiple VDU benchmarks and that APS/DTS reduce visual sequence length and optimize training time, with DTS offering additional gains via similarity-weighted aggregation. The approach provides an efficient, modular solution for high-density document understanding that can be readily integrated into existing MLLMs to boost performance in real-world document analysis tasks.

Abstract

Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.
Paper Structure (21 sections, 5 equations, 8 figures, 8 tables, 3 algorithms)

This paper contains 21 sections, 5 equations, 8 figures, 8 tables, 3 algorithms.

Figures (8)

  • Figure 1: The overall architecture of our DocKylin model.
  • Figure 2: The proposed Adaptive Pixel Slimming module. It effectively reduces the resolution of document images by removing redundant regions.
  • Figure 3: The proposed Similarity Weighted Aggregation module. It aggregates nonessential tokens into essential ones through similarity-weighted summation.
  • Figure 4: Visualization from DocKylin. The redundant pixels and nonessential tokens identified by Adaptive Pixel Slimming and Dynamic Token Slimming are highlighted in light blue. Zoom in for best view.
  • Figure 5: The results of APS-mask and DTS-mask. The identified redundant regions are highlighted in light blue. To minimize any additional effects caused by masking, the values in the masked regions are set to the average value of the pixels in the current region.
  • ...and 3 more figures