Table of Contents
Fetching ...

SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding

Jiefeng Ma, Yan Wang, Chenyu Liu, Jun Du, Yu Hu, Zhenrong Zhang, Pengfei Hu, Qing Wang, Jianshu Zhang

TL;DR

SRFUND tackles the lack of global hierarchical structure in form understanding by introducing a multilingual, multitask benchmark that reconstructs form structure from words to the full document. Built on FUNSD/XFUND, SRFUND provides refined word-, text-line-, and entity-level annotations plus item-table localization and global entity relations across eight languages, enabling five tasks including word-to-text-line merging and hierarchical structure recovery. The experimental results show uni-modal approaches underperform relative to multi-modal document-pretrained models, with GraphDoc often delivering the strongest performance for global structure recovery, highlighting the value of sentence-level semantics. This dataset enables cross-lingual form understanding research and practical form-processing applications, with code and data available at the project URL.

Abstract

Accurately identifying and organizing textual content is crucial for the automation of document processing in the field of form understanding. Existing datasets, such as FUNSD and XFUND, support entity classification and relationship prediction tasks but are typically limited to local and entity-level annotations. This limitation overlooks the hierarchically structured representation of documents, constraining comprehensive understanding of complex forms. To address this issue, we present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets, encompassing five tasks: (1) word to text-line merging, (2) text-line to entity merging, (3) entity category classification, (4) item table localization, and (5) entity-based full-document hierarchical structure recovery. We meticulously supplemented the original dataset with missing annotations at various levels of granularity and added detailed annotations for multi-item table regions within the forms. Additionally, we introduce global hierarchical structure dependencies for entity relation prediction tasks, surpassing traditional local key-value associations. The SRFUND dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese, making it a powerful tool for cross-lingual form understanding. Extensive experimental results demonstrate that the SRFUND dataset presents new challenges and significant opportunities in handling diverse layouts and global hierarchical structures of forms, thus providing deep insights into the field of form understanding. The original dataset and implementations of baseline methods are available at https://sprateam-ustc.github.io/SRFUND

SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding

TL;DR

SRFUND tackles the lack of global hierarchical structure in form understanding by introducing a multilingual, multitask benchmark that reconstructs form structure from words to the full document. Built on FUNSD/XFUND, SRFUND provides refined word-, text-line-, and entity-level annotations plus item-table localization and global entity relations across eight languages, enabling five tasks including word-to-text-line merging and hierarchical structure recovery. The experimental results show uni-modal approaches underperform relative to multi-modal document-pretrained models, with GraphDoc often delivering the strongest performance for global structure recovery, highlighting the value of sentence-level semantics. This dataset enables cross-lingual form understanding research and practical form-processing applications, with code and data available at the project URL.

Abstract

Accurately identifying and organizing textual content is crucial for the automation of document processing in the field of form understanding. Existing datasets, such as FUNSD and XFUND, support entity classification and relationship prediction tasks but are typically limited to local and entity-level annotations. This limitation overlooks the hierarchically structured representation of documents, constraining comprehensive understanding of complex forms. To address this issue, we present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets, encompassing five tasks: (1) word to text-line merging, (2) text-line to entity merging, (3) entity category classification, (4) item table localization, and (5) entity-based full-document hierarchical structure recovery. We meticulously supplemented the original dataset with missing annotations at various levels of granularity and added detailed annotations for multi-item table regions within the forms. Additionally, we introduce global hierarchical structure dependencies for entity relation prediction tasks, surpassing traditional local key-value associations. The SRFUND dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese, making it a powerful tool for cross-lingual form understanding. Extensive experimental results demonstrate that the SRFUND dataset presents new challenges and significant opportunities in handling diverse layouts and global hierarchical structures of forms, thus providing deep insights into the field of form understanding. The original dataset and implementations of baseline methods are available at https://sprateam-ustc.github.io/SRFUND
Paper Structure (18 sections, 3 figures, 7 tables)

This paper contains 18 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Multiple granularity of annotations and supported tasks on SRFUND.
  • Figure 2: Models with varied modalities used for evaluating on the SRFUND benchmark.
  • Figure 3: Visualization of correct (the green boxes) and incorrect (the red boxes) bounding box predictions to capture the Header entity (texts with yellow background). Bounding box must include exactly the word-level centers that lie within the ground truth annotation. Note: in Figure \ref{['fig:detect_explain_1']}, only one of the predictions would be considered correct if all three boxes were predicted.