Table of Contents
Fetching ...

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, Ji Zhang

TL;DR

The paper tackles the efficiency and capacity limitations of layout-as-token approaches in large language models by removing the need for extra position IDs. It introduces LayTokenLLM, which encodes layout information as a single token per text segment and reuses the first text position ID, preserving full text-learning capacity and reducing long-context inference issues. A novel pretraining objective, Next Interleaved Text and Layout Token Prediction (NTLP), jointly trains the model to predict both text and layout tokens and includes a layout head to recover bounding-box coordinates. Empirical results show LayTokenLLM outperforming existing layout-integrated LLMs and many multimodal LLMs on multi-page document tasks while maintaining competitive performance on single-page tasks and offering improved efficiency, highlighting its practicality for scalable document understanding.

Abstract

Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

TL;DR

The paper tackles the efficiency and capacity limitations of layout-as-token approaches in large language models by removing the need for extra position IDs. It introduces LayTokenLLM, which encodes layout information as a single token per text segment and reuses the first text position ID, preserving full text-learning capacity and reducing long-context inference issues. A novel pretraining objective, Next Interleaved Text and Layout Token Prediction (NTLP), jointly trains the model to predict both text and layout tokens and includes a layout head to recover bounding-box coordinates. Empirical results show LayTokenLLM outperforming existing layout-integrated LLMs and many multimodal LLMs on multi-page document tasks while maintaining competitive performance on single-page tasks and offering improved efficiency, highlighting its practicality for scalable document understanding.

Abstract

Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them with text content as inputs to the LLMs. However, such a method still demonstrates limitations, as it requires additional position IDs for tokens that are used to represent layout information. Due to the constraint on max position IDs, assigning them to layout information reduces those available for text content, reducing the capacity for the model to learn from the text during training, while also introducing a large number of potentially untrained position IDs during long-context inference, which can hinder performance on document understanding tasks. To address these issues, we propose LayTokenLLM, a simple yet effective method for document understanding. LayTokenLLM represents layout information as a single token per text segment and uses a specialized positional encoding scheme. It shares position IDs between text and layout tokens, eliminating the need for additional position IDs. This design maintains the model's capacity to learn from text while mitigating long-context issues during inference. Furthermore, a novel pre-training objective called Next Interleaved Text and Layout Token Prediction (NTLP) is devised to enhance cross-modality learning between text and layout tokens. Extensive experiments show that LayTokenLLM outperforms existing layout-integrated LLMs and MLLMs of similar scales on multi-page document understanding tasks, as well as most single-page tasks.

Paper Structure

This paper contains 21 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison with other Layout-as-Token methods. Previous Layout-as-Token methods require additional position IDs for layout information which squeeze the learning space for text content, while LayTokenLLM eliminates the need for additional position IDs of layout information by sharing the first position ID of corresponding text content.
  • Figure 2: The overall architecture of LayTokenLLM. Given the text segments with layouts parsed from document (e.g., by OCR), LayTokenLLM first tokenizes the layout information (bounding box) of each text segment into a single layout token by leveraging a trainable projector and an attention module with learnable query. Subsequently, the text tokens and layout tokens are interleaved and the position IDs are assigned by sharing the first position ID of each text segment with the corresponding layout token, preserving the entire learning space for textual content. Finally, distinct training objectives are employed for the text and layout information, respectively.
  • Figure 3: Illustration of the Next Interleaved Text and Layout Token Prediction objective. The supervision is conducted on both text and layout tokens to reconstruct text content and layout information simultaneously.
  • Figure 4: Qualitative results on (a) single-page and (b) multi-page document QA, where "Qwen1.5-7B (Text+Layout)" is trained with the same data and LLM as LayTokenLLM-7B, but employs norm text and layout format ("text, [123, 456, 133, 500]") instead of Layout Token. The Yellow highlights denote the relevant areas or keys for QA, while the Green highlights indicate the correct answers. (c) Distribution of statistical ANLS in terms of pages along the posed questions on MP-DocVQA. (d) Comparison of layout-related performance using the single-page document dataset, DocVQA.