DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
TL;DR
DocPolarBERT tackles document understanding by removing absolute 2D positional embeddings and using relative polar coordinates to encode layout in attention. The model is pre-trained on publicly available OCR data and evaluated on NER tasks, achieving competitive results with a corpus substantially smaller than IIT-CDIP. Ablation studies show that absolute 2D embeddings can hurt performance, and quantile-based distance bucketing enhances cross-layout generalization, while the approach remains competitive on longer documents. Overall, the work demonstrates an efficient, vision-free alternative for layout-driven document understanding with detailed analyses of attention patterns and scalability.
Abstract
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
