Table of Contents
Fetching ...

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

TL;DR

DocPolarBERT tackles document understanding by removing absolute 2D positional embeddings and using relative polar coordinates to encode layout in attention. The model is pre-trained on publicly available OCR data and evaluated on NER tasks, achieving competitive results with a corpus substantially smaller than IIT-CDIP. Ablation studies show that absolute 2D embeddings can hurt performance, and quantile-based distance bucketing enhances cross-layout generalization, while the approach remains competitive on longer documents. Overall, the work demonstrates an efficient, vision-free alternative for layout-driven document understanding with detailed analyses of attention patterns and scalability.

Abstract

We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

TL;DR

DocPolarBERT tackles document understanding by removing absolute 2D positional embeddings and using relative polar coordinates to encode layout in attention. The model is pre-trained on publicly available OCR data and evaluated on NER tasks, achieving competitive results with a corpus substantially smaller than IIT-CDIP. Ablation studies show that absolute 2D embeddings can hurt performance, and quantile-based distance bucketing enhances cross-layout generalization, while the approach remains competitive on longer documents. Overall, the work demonstrates an efficient, vision-free alternative for layout-driven document understanding with detailed analyses of attention patterns and scalability.

Abstract

We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

Paper Structure

This paper contains 33 sections, 7 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Illustration of relative positional attention in different models. Multiple words can share the same bounding box. Attention is skewed with respect to: (a) the horizontal and vertical distances between the top left corners; (b) distance between the four corners, respectively; (c) distance and angle between centers of the two bounding boxes.
  • Figure 2: Illustration of relative positions for the token "name" and its neighbors in polar coordinates.
  • Figure 3: Heatmap of average attention (across all heads and layers) for a NET_PAY_PER_PERIOD token (green box) on a Payslips document. Blue-to-red values indicate low-to-high attention. We compare LayoutLM, $\textsc{LayoutLMv3}^*$, and DocPolarBERT, highlighting with red bounding boxes the unwanted activations.
  • Figure 4: Heatmap of average attention for a POST_TAX_DEDUCTIONS_PER_PERIOD token (green box) on a Payslips document. We compare LayoutLM, $\textsc{LayoutLMv3}^*$, and DocPolarBERT, highlighting with red bounding boxes the unwanted activations. In this example, LayoutLM and LayoutLMv3 misclassified the target amount, while DocPolarBERT correctly identifies it.