Table of Contents
Fetching ...

REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao

Erik Tjong Kim Sang

TL;DR

REE-HDSC investigates automatic recognition and extraction of entities from hand-written text recognition outputs for Curaçao death certificates (1831–1950). The authors implement a six-step pipeline (layout analysis, baseline, HTR, entity recognition, name correction, and entity linking) and evaluate it on historical civil registry data, comparing regular expressions, ChatGPT, and HTR retraining approaches. They find high precision for dates but limited accuracy for person names with the baseline system; retraining HTR on names, targeted post-processing, and margin-name consolidation substantially improve name recognition (up to ~70–75% in retrained models), though entity linking remains challenging. The work demonstrates practical, scalable paths toward semi-automatic data capture from historical documents and outlines concrete steps—more training data, layout-order correction, and volunteer involvement—for future improvements and broader applicability.

Abstract

We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.

REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao

TL;DR

REE-HDSC investigates automatic recognition and extraction of entities from hand-written text recognition outputs for Curaçao death certificates (1831–1950). The authors implement a six-step pipeline (layout analysis, baseline, HTR, entity recognition, name correction, and entity linking) and evaluate it on historical civil registry data, comparing regular expressions, ChatGPT, and HTR retraining approaches. They find high precision for dates but limited accuracy for person names with the baseline system; retraining HTR on names, targeted post-processing, and margin-name consolidation substantially improve name recognition (up to ~70–75% in retrained models), though entity linking remains challenging. The work demonstrates practical, scalable paths toward semi-automatic data capture from historical documents and outlines concrete steps—more training data, layout-order correction, and volunteer involvement—for future improvements and broader applicability.

Abstract

We describe the project REE-HDSC and outline our efforts to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. We describe a six-step processing pipeline and test it by processing 19th and 20th century death certificates from the civil registry of Curacao. We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low. Next we show how name precision extraction can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names.
Paper Structure (27 sections, 5 figures, 6 tables)

This paper contains 27 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Flow diagram of automatic death certificate analysis process with six different tasks
  • Figure 2: Three main certificate form formats in the data: three-column (1831-1869, left), early two-column (1869-1934, middle) and late two-column (1935-1950, right). The main text of the forms can be found in the widest center column. However, narrower margin columns may contain hand-written notes which need to be processed as well.
  • Figure 3: Number of death certificates in the collection per year after data cleaning. Left graph: there are two groups of certificates: certificates from the capital (stad) and certificates from the other districts (buiten). Right graph: interestingly the extra Excel annotation file contains data for more certificates than we have scans, with the years 1869 and 1887 as notable exceptions. These years may contain several duplicates in our collection.
  • Figure 4: Text regions identified in two-column texts (1870-1950) and in three-column texts (1831-1869, right) by the model P2PaLA_Curacao _bestModel from Hoek hoek2023. The recognition of the main text column seems to work fine for two-column text but not for three-column text, where the center text column has been combined with the right margin column in 97% of the data.
  • Figure 5: Text regions identified in 100 randomly selected three-column texts (1831-1868) with a two-column layout model (left) and with a three-column layout model (right). The three-column model more often predicts a center column with space for right margin text (75% vs 6%).