Table of Contents
Fetching ...

Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

TL;DR

This work tackles the limited transferability of generic pre-trained models to domain-specific finance and insurance NLP tasks under privacy constraints. It demonstrates that pre-training LayoutLM from scratch on a domain-relevant corpus (DOCILE) yields better NER performance on a novel Payslips dataset, and shows that a smaller, faster 6-layer variant can achieve comparable results with substantial inference-time gains. The authors release Payslips and reveal that domain-specific pre-training reduces performance variance and improves outcomes compared to using IIT-CDIP. This provides a practical path for efficient, in-house NER in sensitive domains without relying on large public corpora. The study highlights the importance of dataset alignment and model compression for production-ready document understanding in the financial sector.

Abstract

Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.

Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain

TL;DR

This work tackles the limited transferability of generic pre-trained models to domain-specific finance and insurance NLP tasks under privacy constraints. It demonstrates that pre-training LayoutLM from scratch on a domain-relevant corpus (DOCILE) yields better NER performance on a novel Payslips dataset, and shows that a smaller, faster 6-layer variant can achieve comparable results with substantial inference-time gains. The authors release Payslips and reveal that domain-specific pre-training reduces performance variance and improves outcomes compared to using IIT-CDIP. This provides a practical path for efficient, in-house NER in sensitive domains without relying on large public corpora. The study highlights the importance of dataset alignment and model compression for production-ready document understanding in the financial sector.

Abstract

Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.

Paper Structure

This paper contains 14 sections, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Sample of the newly introduced Payslips dataset for named-entity recognition in the insurance domain.
  • Figure 2: Samples from IIT-CDIP (first column), DOCILE (second column) and Payslips (third column) datasets. Invoices from DOCILE and pay statements from Payslips are closer visually and semantically.