Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain
Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
TL;DR
This work tackles the limited transferability of generic pre-trained models to domain-specific finance and insurance NLP tasks under privacy constraints. It demonstrates that pre-training LayoutLM from scratch on a domain-relevant corpus (DOCILE) yields better NER performance on a novel Payslips dataset, and shows that a smaller, faster 6-layer variant can achieve comparable results with substantial inference-time gains. The authors release Payslips and reveal that domain-specific pre-training reduces performance variance and improves outcomes compared to using IIT-CDIP. This provides a practical path for efficient, in-house NER in sensitive domains without relying on large public corpora. The study highlights the importance of dataset alignment and model compression for production-ready document understanding in the financial sector.
Abstract
Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.
