A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu; Chuwei Luo; Zirui Shao; Feiyu Gao; Hangdi Xing; Qi Zheng; Ji Zhang

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, Ji Zhang

TL;DR

The paper tackles the efficiency and capacity limitations of layout-as-token approaches in large language models by removing the need for extra position IDs. It introduces LayTokenLLM, which encodes layout information as a single token per text segment and reuses the first text position ID, preserving full text-learning capacity and reducing long-context inference issues. A novel pretraining objective, Next Interleaved Text and Layout Token Prediction (NTLP), jointly trains the model to predict both text and layout tokens and includes a layout head to recover bounding-box coordinates. Empirical results show LayTokenLLM outperforming existing layout-integrated LLMs and many multimodal LLMs on multi-page document tasks while maintaining competitive performance on single-page tasks and offering improved efficiency, highlighting its practicality for scalable document understanding.

Abstract

Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

TL;DR

Abstract

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)