UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Yi Tu; Chong Zhang; Ya Guo; Huan Chen; Jinyang Tang; Huijia Zhu; Qi Zhang

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Yi Tu, Chong Zhang, Ya Guo, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang

TL;DR

VrD-NER is challenged by complex layouts, reading-order variability, and rigid sequence-labeling paradigms. The authors introduce UNER, a unified, query-aware head that couples a query-aware token classifier (QTC) with a token order predictor (TOP) atop existing multi-modal document transformers to jointly extract entities and infer reading order. They further show that supervised pre-training on diverse VrD-NER datasets injects universal layout and entity-knowledge, boosting cross-domain transfer and enabling few-shot and zero-shot capabilities. Empirical results across seven benchmarks demonstrate strong gains over prior heads and show effective cross-lingual transfer when augmented with supervised pre-training, enabling robust, data-efficient VrD-NER in real-world documents.

Abstract

The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 4 figures, 5 tables)

This paper contains 12 sections, 6 equations, 4 figures, 5 tables.

Introduction
Related Work
Methodology
Query-aware Token Classification
Token Order Prediction
Optimization and Inference
Experiments
Datasets
Experimental Settings
Effectiveness of UNER
Effectiveness of Supervised Pre-training
Conclusion

Figures (4)

Figure 1: Illustration of the common issues in the VrD-NER problem. We utilize color-coded words to indicate entities and words of the same color signify a complete entity span. These issues contribute to the complexity of understanding the document and result in discontinuous entities when the document is arranged in common reading order (e.g., top-to-bottom and left-to-right).
Figure 2: An overview of the entity extraction pipeline for a document transformer using a UNER head. For better illustration, we use a document with a reading order issue as input. Given the input document, the UNER-based model receives the entity names ("header", "question", and "answer") as queries and conducts token classification and order prediction in its two submodules, TOP and QTC. Here we use numbers to denote the ground-truth labels ("1" for positive and blanks for negative) and colored backgrounds to denote the binary classification predictions (gray for positive and white for negative). Ultimately, we combine the predictions for decoding and obtain the full entity spans. Incorrect predictions are denoted by the red color or a cross.
Figure 3: The few-shot performance on the CORD-r dataset when using different percentages of training samples (from 5% to 100%). We use LayoutMask as the backbone and compare its performance with different prediction heads or supervise pre-training conditions. (1): The entity-level F1 scores in VrD-NER when using TPP, UNER, and UNER with supervised pre-training ("UNER+SP"). (2)&(3): For the UNER-based method, we also report the performance of its submodules, the token-level classification accuracy in QTC, and the token order classification accuracy in TOP. As we do not have the complete reading order annotations for all the tokens in the documents, in TOP we only calculate the accuracy for entity-related tokens.
Figure 4: Visualization of the entity predictions. We pre-train "LayoutMask+UNER" with SVRD and DocILE and display the predicted entities and their confidence scores with various queries. The grey texts in the brackets serve as translations and are not used as input queries. Incorrect predictions are highlighted in red. (1): A receipt from SROIE with queries that differ from the entity types in the original dataset. (2): Extraction of overlapped entities with vertical alignment in SIBR. (3): A bilingual air ticket with a misaligned layout.

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

TL;DR

Abstract

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)