Table of Contents
Fetching ...

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

David Fleischhacker, Wolfgang Goederle, Roman Kern

TL;DR

This work tackles OCR quality on large, structurally complex 19th‑century Schematismus sources by pairing a ML‑driven layout detector with a fine‑tuned OCR engine. A Faster R‑CNN trained on a large synthetic Schematismus‑style dataset segments pages into coherent blocks, which are then OCR’ed by a Tesseract model trained on a custom font that matches the originals. The combined approach yields substantial improvements in character and word error rates, with CER improvements up to 71.98% and WER improvements up to 52.49% relative to out‑of‑the‑box baselines, and demonstrates the potential to enable large‑scale data extraction from historical state manuals. The method also provides a scalable path using synthetic data to bootstrap layout detectors and limited real‑document annotations for fine‑tuning, broadening access to rich historical data for social, administrative, and genealogical research.

Abstract

This paper addresses a major challenge to historical research on the 19th century. Large quantities of sources have become digitally available for the first time, while extraction techniques are lagging behind. Therefore, we researched machine learning (ML) models to recognise and extract complex data structures in a high-value historical primary source, the Schematismus. It records every single person in the Habsburg civil service above a certain hierarchical level between 1702 and 1918 and documents the genesis of the central administration over two centuries. Its complex and intricate structure as well as its enormous size have so far made any more comprehensive analysis of the administrative and social structure of the later Habsburg Empire on the basis of this source impossible. We pursued two central objectives: Primarily, the improvement of the OCR quality, for which we considered an improved structure recognition to be essential; in the further course, it turned out that this also made the extraction of the data structure possible. We chose Faster R-CNN as base for the ML architecture for structure recognition. In order to obtain the required amount of training data quickly and economically, we synthesised Hof- und Staatsschematismus-style data, which we used to train our model. The model was then fine-tuned with a smaller set of manually annotated historical source data. We then used Tesseract-OCR, which was further optimised for the style of our documents, to complete the combined structure extraction and OCR process. Results show a significant decrease in the two standard parameters of OCR-performance, WER and CER (where lower values are better). Combined structure detection and fine-tuned OCR improved CER and WER values by remarkable 71.98 percent (CER) respectively 52.49 percent (WER).

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

TL;DR

This work tackles OCR quality on large, structurally complex 19th‑century Schematismus sources by pairing a ML‑driven layout detector with a fine‑tuned OCR engine. A Faster R‑CNN trained on a large synthetic Schematismus‑style dataset segments pages into coherent blocks, which are then OCR’ed by a Tesseract model trained on a custom font that matches the originals. The combined approach yields substantial improvements in character and word error rates, with CER improvements up to 71.98% and WER improvements up to 52.49% relative to out‑of‑the‑box baselines, and demonstrates the potential to enable large‑scale data extraction from historical state manuals. The method also provides a scalable path using synthetic data to bootstrap layout detectors and limited real‑document annotations for fine‑tuning, broadening access to rich historical data for social, administrative, and genealogical research.

Abstract

This paper addresses a major challenge to historical research on the 19th century. Large quantities of sources have become digitally available for the first time, while extraction techniques are lagging behind. Therefore, we researched machine learning (ML) models to recognise and extract complex data structures in a high-value historical primary source, the Schematismus. It records every single person in the Habsburg civil service above a certain hierarchical level between 1702 and 1918 and documents the genesis of the central administration over two centuries. Its complex and intricate structure as well as its enormous size have so far made any more comprehensive analysis of the administrative and social structure of the later Habsburg Empire on the basis of this source impossible. We pursued two central objectives: Primarily, the improvement of the OCR quality, for which we considered an improved structure recognition to be essential; in the further course, it turned out that this also made the extraction of the data structure possible. We chose Faster R-CNN as base for the ML architecture for structure recognition. In order to obtain the required amount of training data quickly and economically, we synthesised Hof- und Staatsschematismus-style data, which we used to train our model. The model was then fine-tuned with a smaller set of manually annotated historical source data. We then used Tesseract-OCR, which was further optimised for the style of our documents, to complete the combined structure extraction and OCR process. Results show a significant decrease in the two standard parameters of OCR-performance, WER and CER (where lower values are better). Combined structure detection and fine-tuned OCR improved CER and WER values by remarkable 71.98 percent (CER) respectively 52.49 percent (WER).
Paper Structure (22 sections, 2 equations, 26 figures, 7 tables)

This paper contains 22 sections, 2 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: An example of the R-CNN pipeline: starting with the input image, region proposals are generated and passed through a CNN to extract features, which are then used to classify and localize objects 6909475.
  • Figure 2: A visual representation of the Fast R-CNN architecture: starting from the input image, region proposals (RoI projection) are generated using the selective search algorithm and passed through a CNN to extract features, which are then used by a region of interest (RoI) pooling layer to extract a feature vector for each proposal. These feature vectors are then passed through twin layers of a softmax classifier and bounding box regression for the classification and localization of objects in the image fast-r-cnn.
  • Figure 3: A visual representation of the Faster R-CNN region proposal network (RPN): The input image is passed through a deep neural network to extract feature maps, which are then processed by the RPN to generate region proposals. The RPN uses sliding windows and anchor boxes to predict the probability and class of objects in the image NIPS2015_14bfa6bb.
  • Figure 4: The flowchart shows a simplified version of the general process of developing a model to detect layout of schematismus documents.
  • Figure 5: This flowchart illustrates how each extracted layout element is processed by OCR.
  • ...and 21 more figures