Table of Contents
Fetching ...

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin

TL;DR

DocLayLLM introduces an efficient multi-modal extension of LLMs for text-rich document understanding by directly encoding OCR text, 2D layout, and patch-based visual information into a single LLM input, eliminating the need for separate document encoders. It couples this with CoT Pre-training to generate QA-formatted step-by-step data across layout, table, and geometry tasks and a CoT Annealing strategy to progressively favor direct answers during SFT. The approach achieves strong state-of-the-art or competitive performance against OCR-dependent and OCR-free baselines with substantially reduced training resources, and exhibits robust zero-shot generalization. These contributions demonstrate that a lightweight, unified LLM framework can effectively handle diverse TDU tasks while maintaining efficiency and scalability in real-world settings.

Abstract

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain, existing approaches either demand significant computational resources or struggle with effective multi-modal integration. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs' input and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors. Code and model are available at https://github.com/whlscut/DocLayLLM.

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

TL;DR

DocLayLLM introduces an efficient multi-modal extension of LLMs for text-rich document understanding by directly encoding OCR text, 2D layout, and patch-based visual information into a single LLM input, eliminating the need for separate document encoders. It couples this with CoT Pre-training to generate QA-formatted step-by-step data across layout, table, and geometry tasks and a CoT Annealing strategy to progressively favor direct answers during SFT. The approach achieves strong state-of-the-art or competitive performance against OCR-dependent and OCR-free baselines with substantially reduced training resources, and exhibits robust zero-shot generalization. These contributions demonstrate that a lightweight, unified LLM framework can effectively handle diverse TDU tasks while maintaining efficiency and scalability in real-world settings.

Abstract

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain, existing approaches either demand significant computational resources or struggle with effective multi-modal integration. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs' input and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors. Code and model are available at https://github.com/whlscut/DocLayLLM.
Paper Structure (22 sections, 3 equations, 5 figures, 9 tables)

This paper contains 22 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The overall architecture of DocLayLLM.
  • Figure 2: Examples of CoT Pre-training across different tasks. Red text highlights key steps in the chain of thought for each task.
  • Figure 3: Qualitative comparisons of DocLayLLM with other OCR-dependent and OCR-free methods. Zoom in for better view.
  • Figure A: Further qualitative comparisons of DocLayLLM against the SOTA OCR-free method and under various settings.
  • Figure B: Illustration of DocLayLLM's OCR error correction capability.