Table of Contents
Fetching ...

PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction

Zening Lin, Jiapeng Wang, Teng Li, Wenhui Liao, Dayi Huang, Longfei Xiong, Lianwen Jin

TL;DR

A novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking, and is introduced.

Abstract

Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations are available at https://github.com/ZeningLin/PEneo.

PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking for End-to-end Document Pair Extraction

TL;DR

A novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking, and is introduced.

Abstract

Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations are available at https://github.com/ZeningLin/PEneo.
Paper Structure (28 sections, 6 equations, 6 figures, 8 tables)

This paper contains 28 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Examples of SER, RE, and document pair extraction in funsd. (a) SER task, which aims at classifying fields into specific given entity categories. (b) RE task, which predicts the relations (green arrows) between the given entities. (c) Document pair extraction task that requires extraction of all key-value pairs from the document image.
  • Figure 2: Examples of the original FUNSD and XFUND annotations. Boxes in blue, green, and grey stand for question, answer, and other entities, respectively. Green arrows refer to key-value linkings. (a) Annotations for entities with first-line indentation in FUNSD. (b) Inconsistent labeling granularity in XFUND, keys are labeled at entity level, while values are at line level. (c) Confusing annotations, answer entity "Client confirmed agreement ..." was labeled as other, while the other entity "CONFIDENTIAL" was labeled as the question.
  • Figure 3: Model architecture of PEneo. Line-level OCR results are processed by the pre-trained multi-modal encoder to get representations of each token. The decoder then generates pair-wise features and applies line extraction, line grouping, and entity linking to obtain predictions of line spans, line aggregation, and key-value relations. Finally, the linking parsing module integrates the predictions above to generate key-value pairs.
  • Figure 4: Performance comparison between PEneo and SER+RE. Left: prediction of SER+RE. Blue, green, and grey boxes indicate prediction for question, answer, and other entities, respectively. Right: prediction of PEneo. The green boxes are correctly extracted lines or entities, red are false positives. The green arrows are correct pair predictions, and the red arrows are wrong.
  • Figure 5: Impact of different SER results on pair extraction performance. FN refers to entity false negative, FP refers to entity false positive, CE refers to entity category error, and EF refers to entity fragmentation.
  • ...and 1 more figures