Table of Contents
Fetching ...

Combining OCR Models for Reading Early Modern Printed Books

Mathias Seuret, Janne van der Loop, Nikolaus Weichselbaumer, Martin Mayr, Janina Molnar, Tatjana Hass, Florian Kordon, Anguelos Nicolau, Vincent Christlein

TL;DR

A system using local font group recognition in order to combine the output of multiple font recognition models, and it is shown that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.

Abstract

In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.

Combining OCR Models for Reading Early Modern Printed Books

TL;DR

A system using local font group recognition in order to combine the output of multiple font recognition models, and it is shown that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.

Abstract

In this paper, we investigate the usage of fine-grained font recognition on OCR for books printed from the 15th to the 18th century. We used a newly created dataset for OCR of early printed books for which fonts are labeled with bounding boxes. We know not only the font group used for each character, but the locations of font changes as well. In books of this period, we frequently find font group changes mid-line or even mid-word that indicate changes in language. We consider 8 different font groups present in our corpus and investigate 13 different subsets: the whole dataset and text lines with a single font, multiple fonts, Roman fonts, Gothic fonts, and each of the considered fonts, respectively. We show that OCR performance is strongly impacted by font style and that selecting fine-tuned models with font group recognition has a very positive impact on the results. Moreover, we developed a system using local font group recognition in order to combine the output of multiple font recognition models, and show that while slower, this approach performs better not only on text lines composed of multiple fonts but on the ones containing a single font only as well.
Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pipeline of the SelOCR system.
  • Figure 2: Pipeline of the COCR system.
  • Figure 3: Illustration of the offline augmentations which we applied.
  • Figure 4: Classification results of pixel columns. The two plots correspond to the classification scores for Antiqua and Textura. Colors indicate the pixel column with the highest score. Results for other font groups are not shown, as they were all extremely close to zero.
  • Figure 5: Mean cer for every text line length in our dataset.