Table of Contents
Fetching ...

Large Language Models Understand Layout

Weiming Li, Manni Duan, Dong An, Yan Shao

TL;DR

This work investigates whether large language models can understand and reason about text layout when layout is encoded with spatial markers in plain text. The authors introduce TextLayoutQA to quantify layout understanding, analyze its origins in pretraining and instruction-tuning, and show that layout information substantially boosts performance (8–33%) across several models. They demonstrate that code- and table-rich data during training markedly enhances layout understanding, and they propose low-cost, auto-generated data via a text-game to further boost this capability. They also translate layout-aware processing into practical gains for text-rich VQA through a method called TextLayoutParser, improving performance on XfundQA, DocVQA, and FeTaQA. Overall, the paper highlights a new axis of capability for LLMs and offers guidance on data design to harness layout reasoning in NLP and document-analysis tasks.

Abstract

Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.

Large Language Models Understand Layout

TL;DR

This work investigates whether large language models can understand and reason about text layout when layout is encoded with spatial markers in plain text. The authors introduce TextLayoutQA to quantify layout understanding, analyze its origins in pretraining and instruction-tuning, and show that layout information substantially boosts performance (8–33%) across several models. They demonstrate that code- and table-rich data during training markedly enhances layout understanding, and they propose low-cost, auto-generated data via a text-game to further boost this capability. They also translate layout-aware processing into practical gains for text-rich VQA through a method called TextLayoutParser, improving performance on XfundQA, DocVQA, and FeTaQA. Overall, the paper highlights a new axis of capability for LLMs and offers guidance on data design to harness layout reasoning in NLP and document-analysis tasks.

Abstract

Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
Paper Structure (37 sections, 9 figures, 14 tables)

This paper contains 37 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Illustration of ChatGPT comprehending text layout.
  • Figure 2: A pair example of TextLayoutQA dataset with (a) and without layout (b), they share the same QA set (c).
  • Figure 3: An example of the instruction-table dataset.
  • Figure 4: An example of a word search puzzle.
  • Figure 5: An example of the instruction-generated dataset.
  • ...and 4 more figures