Table of Contents
Fetching ...

LAPDoc: Layout-Aware Prompting for Documents

Marcel Lamott, Yves-Noel Weweler, Adrian Ulges, Faisal Shafait, Dirk Krechel, Darko Obradovic

TL;DR

This paper tackles the challenge of document understanding using text-only large language models by injecting document layout through a verbalization-based prompting pipeline. It introduces multiple verbalizers to encode text and geometry, noise models to test robustness, and diverse prompt templates, evaluated on ChatGPT and Solar across standard document benchmarks without fine-tuning. The results show that layout-aware prompting yields notable improvements (up to around 15%) and achieves state-of-the-art performance on InfographicsVQA and WikiTableQuestions, with open-source models approaching commercial performance on several tasks. The findings suggest when to favor text-based LLMs over multi-modal document transformers, and point to practical paths for deploying document understanding systems with minimal additional training, while highlighting limitations in complex layouts and multi-page reasoning.

Abstract

Recent advances in training large language models (LLMs) using massive amounts of solely textual data lead to strong generalization across many domains and tasks, including document-specific tasks. Opposed to that there is a trend to train multi-modal transformer architectures tailored for document understanding that are designed specifically to fuse textual inputs with the corresponding document layout. This involves a separate fine-tuning step for which additional training data is required. At present, no document transformers with comparable generalization to LLMs are available That raises the question which type of model is to be preferred for document understanding tasks. In this paper we investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment. We explore drop-in modifications and rule-based methods to enrich purely textual LLM prompts with layout information. In our experiments we investigate the effects on the commercial ChatGPT model and the open-source LLM Solar. We demonstrate that using our approach both LLMs show improved performance on various standard document benchmarks. In addition, we study the impact of noisy OCR and layout errors, as well as the limitations of LLMs when it comes to utilizing document layout. Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15% compared to just using plain document text. In conclusion, this approach should be considered for the best model choice between text-based LLM or multi-modal document transformers.

LAPDoc: Layout-Aware Prompting for Documents

TL;DR

This paper tackles the challenge of document understanding using text-only large language models by injecting document layout through a verbalization-based prompting pipeline. It introduces multiple verbalizers to encode text and geometry, noise models to test robustness, and diverse prompt templates, evaluated on ChatGPT and Solar across standard document benchmarks without fine-tuning. The results show that layout-aware prompting yields notable improvements (up to around 15%) and achieves state-of-the-art performance on InfographicsVQA and WikiTableQuestions, with open-source models approaching commercial performance on several tasks. The findings suggest when to favor text-based LLMs over multi-modal document transformers, and point to practical paths for deploying document understanding systems with minimal additional training, while highlighting limitations in complex layouts and multi-page reasoning.

Abstract

Recent advances in training large language models (LLMs) using massive amounts of solely textual data lead to strong generalization across many domains and tasks, including document-specific tasks. Opposed to that there is a trend to train multi-modal transformer architectures tailored for document understanding that are designed specifically to fuse textual inputs with the corresponding document layout. This involves a separate fine-tuning step for which additional training data is required. At present, no document transformers with comparable generalization to LLMs are available That raises the question which type of model is to be preferred for document understanding tasks. In this paper we investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment. We explore drop-in modifications and rule-based methods to enrich purely textual LLM prompts with layout information. In our experiments we investigate the effects on the commercial ChatGPT model and the open-source LLM Solar. We demonstrate that using our approach both LLMs show improved performance on various standard document benchmarks. In addition, we study the impact of noisy OCR and layout errors, as well as the limitations of LLMs when it comes to utilizing document layout. Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15% compared to just using plain document text. In conclusion, this approach should be considered for the best model choice between text-based LLM or multi-modal document transformers.
Paper Structure (20 sections, 7 figures, 3 tables)

This paper contains 20 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our approach: Document OCR is converted into a text representation using different verbalization strategies (blue). Before verbalization, we optionally degrade the OCR by applying noise to the spatial position of OCR geometries (red). The resulting document text representation is then inserted into a task specific prompt (yellow) and fed into a LLM (green). Finally, the answers are extracted from the LLM output.
  • Figure 2: Verbalization strategies on a random sample from the SROIE datset: original (left), SpatialFormat (middle) and PlainText (right).
  • Figure 3: Example for QA prompt B with three questions. Pattern B structure is: DOCUMENT (black), TASK (blue), FORMAT (orange) and OUTPUT (green).
  • Figure 4: Comparison of the noise models for each verbalizer. Values shown are scores averaged over all datasets. It shows, that PlainText's performance diminishes when the layout is misinterpreted. Further is shown that SpatialFormat and SpatialFormatY verbalizers are the least prone to noise introduced to the OCR data. Note that they are not affected by changes to the bounding box ordering, as they operate only using the bounding box coordinates.
  • Figure 5: Relative token overhead introduced by each verbalization strategy compared to PlainText baseline. Values are given in percentage of number of tokens required by PlainText verbalizer and averaged over all documents in all datasets. It shows that SpatialFormat and SpatialFormatY introduce the least token overhead compared to the other verbalizers. Furthermore, it shows that SpatialFormatY requires in even less tokens than the PlainText baseline.
  • ...and 2 more figures