TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing
Ran Zmigrod, Zhiqiang Ma, Armineh Nourbakhsh, Sameena Shah
TL;DR
This work addresses the challenge of end-to-end form parsing (VRFU) by identifying limitations in FUNSD-type annotations and proposing TreeForm, a JSON-encoded tree representation that captures hierarchical and tabular form structure. It introduces a novel end-to-end F1 metric and the greedy-aligned tree-edit distance (GAnTED) for holistic evaluation, along with a method to convert FUNSD annotations into TreeForm. Baselines using LayoutXLM and Donut on FUNSD/XFUND demonstrate the viability of TreeForm and reveal tradeoffs between labeling accuracy, edge linking, and structural understanding. By standardizing both annotation and evaluation through TreeForm, the approach aims to spur deeper research into annotating, modeling, and evaluating complex form-like documents, and motivates creating a dedicated TreeForm dataset.
Abstract
Visually Rich Form Understanding (VRFU) poses a complex research problem due to the documents' highly structured nature and yet highly variable style and content. Current annotation schemes decompose form understanding and omit key hierarchical structure, making development and evaluation of end-to-end models difficult. In this paper, we propose a novel F1 metric to evaluate form parsers and describe a new content-agnostic, tree-based annotation scheme for VRFU: TreeForm. We provide methods to convert previous annotation schemes into TreeForm structures and evaluate TreeForm predictions using a modified version of the normalized tree-edit distance. We present initial baselines for our end-to-end performance metric and the TreeForm edit distance, averaged over the FUNSD and XFUND datasets, of 61.5 and 26.4 respectively. We hope that TreeForm encourages deeper research in annotating, modeling, and evaluating the complexities of form-like documents.
