DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin
TL;DR
DocKylin addresses the challenge of visual document understanding under high-resolution inputs by introducing pixel- and token-level slimming. Adaptive Pixel Slimming (APS) removes redundant pixels based on gradient information, while Dynamic Token Slimming (DTS) clusters visual tokens and aggregates nonessential ones, aided by a lightweight Swin-based encoder and a Qwen-7B-Chat LLM, with a flexible input budget up to 1728×1728. Experiments show DocKylin achieves state-of-the-art results on multiple VDU benchmarks and that APS/DTS reduce visual sequence length and optimize training time, with DTS offering additional gains via similarity-weighted aggregation. The approach provides an efficient, modular solution for high-density document understanding that can be readily integrated into existing MLLMs to boost performance in real-world document analysis tasks.
Abstract
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.
