Table of Contents
Fetching ...

HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

Rujiao Long, Pengfei Wang, Zhibo Yang, Cong Yao

TL;DR

Hip, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task, outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.

Abstract

End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.

HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction

TL;DR

Hip, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task, outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.

Abstract

End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.

Paper Structure

This paper contains 18 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the proposed HIP framework and the idea of hierarchical points. VIE is decomposed into three successive tasks: word spotting, word grouping, and entity labeling, each of which is accompanied by two pre-training strategies to learn visual, geometric, and semantic clues.
  • Figure 2: Schematic overview of HIP. The modules of the three hierarchical tasks are color-coded: green for spotting, orange for grouping, and purple for labeling. The top gray dashed box illustrates the main process from word point to entity point, where ETD and Tag branches play the roles of word grouping and entity labeling respectively.
  • Figure 3: The visualization of MIM tasks. The first and second rows are the results of WMIM and CMIM, and the first and second columns are the original images and the reconstructed images respectively. The green box indicates a good case, and the red box signifies a blurred case.
  • Figure 4: Qualitative results on the FUNSD. The columns from left to right represent the visualization of StrucTexTv2 and HIP. The rectangles in green stand for correct results. The red boxes denote word spotting errors, where detection errors and recognition errors are represented by dashed and solid boxes, respectively. The yellow boxes denote word grouping errors. The blue boxes denote entity labeling errors and the misclassified categories are noted with red words.