Table of Contents
Fetching ...

DocGraphLM: Documental Graph Language Model for Information Extraction

Dongsheng Wang, Zhiqiang Ma, Armineh Nourbakhsh, Kang Gu, Sameena Shah

TL;DR

DocGraphLM addresses information extraction and visual-question-answering on Visually Rich Documents by uniting pre-trained language models with graph neural networks to capture both semantic and structural document signals. It introduces a joint encoder and a novel link-prediction objective that predicts node distance and direction, emphasizing nearby neighborhood restoration through a distance-aware loss. Empirical results on FUNSD, CORD, and DocVQA show consistent improvements when graph features are added to layout-language models, and the graph components also accelerate training convergence. The work demonstrates the value of combining document structure with multi-modal semantic representations to enhance VrDU performance and efficiency.

Abstract

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.

DocGraphLM: Documental Graph Language Model for Information Extraction

TL;DR

DocGraphLM addresses information extraction and visual-question-answering on Visually Rich Documents by uniting pre-trained language models with graph neural networks to capture both semantic and structural document signals. It introduces a joint encoder and a novel link-prediction objective that predicts node distance and direction, emphasizing nearby neighborhood restoration through a distance-aware loss. Empirical results on FUNSD, CORD, and DocVQA show consistent improvements when graph features are added to layout-language models, and the graph components also accelerate training convergence. The work demonstrates the value of combining document structure with multi-modal semantic representations to enhance VrDU performance and efficiency.

Abstract

Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.
Paper Structure (15 sections, 4 equations, 2 figures, 4 tables)

This paper contains 15 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The model architecture of DocGraphLM.
  • Figure 2: Model convergence speed comparison on CORD. The curves are generated from averaging over ten trials.