REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao
Erik Tjong Kim Sang
TL;DR
REE-HDSC investigates automatic recognition and extraction of entities from hand-written text recognition outputs for Curaçao death certificates (1831–1950). The authors implement a six-step pipeline (layout analysis, baseline, HTR, entity recognition, name correction, and entity linking) and evaluate it on historical civil registry data, comparing regular expressions, ChatGPT, and HTR retraining approaches. They find high precision for dates but limited accuracy for person names with the baseline system; retraining HTR on names, targeted post-processing, and margin-name consolidation substantially improve name recognition (up to ~70–75% in retrained models), though entity linking remains challenging. The work demonstrates practical, scalable paths toward semi-automatic data capture from historical documents and outlines concrete steps—more training data, layout-order correction, and volunteer involvement—for future improvements and broader applicability.
Abstract
We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.
