Table of Contents
Fetching ...

Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis

Martin Mayr, Julian Krenz, Katharina Neumeier, Anna Bub, Simon Bürcky, Nina Brolich, Klaus Herbers, Mechthild Habermann, Peter Fleischmann, Andreas Maier, Vincent Christlein

Abstract

Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.

Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis

Abstract

Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.

Paper Structure

This paper contains 14 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Step 1 describes data acquisition and preprocessing. The pages of the scanned documents are separated, with a subsequent line segmentation. In step 2, the transcriptions and meta data are manually labeled. The created basic transcriptions are used as a foundation for the regularized and diplomatic text versions. Simultaneously, the meta data, like writer IDs, are marked. In step 3, manual corrections are made, and the produced data is analyzed for technical validation.
  • Figure 2: Overview of the Handwritten Text Recognition model. Red arrows show the image information flow, blue arrows show the text information flow, and black arrows show the combined information flow. The architecture is a combination of a shallow CNN and a transformer.
  • Figure 3: Visualization of dimensionality reduced global feature vectors of all books. Each sample point denotes one letter in the letterbooks and is color-coded by the specific writer label. The box in the middle gives an overview of all samples. The outgoing boxes are zoomed-in versions of writer clusters.