Table of Contents
Fetching ...

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

Sergio Torres Aguilar

TL;DR

The paper addresses robust evaluation for Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) on medieval and early modern manuscripts, arming researchers with a unified, open corpus to overcome annotation inconsistencies and in-domain biases. It introduces TRIDIS, a Parquet-based, metadata-rich resource that aggregates multiple open sub-corpora under semi-diplomatic transcription rules, enabling cross-script and cross-language research. A novel outlier-driven test-split strategy, grounded in joint image-text embeddings, yields a more challenging and realistic assessment of model robustness, with baseline experiments (TrOCR and MiniCPM-Llama3-V 2.5) illustrating a sizable gap on outlier data compared to random splits. The work aims to spur cross-domain transfer and robust HTR/NER methods for heritage documents, with future plans to broaden scripts, languages, and metadata coverage to further mitigate domain drift.

Abstract

This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

TL;DR

The paper addresses robust evaluation for Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) on medieval and early modern manuscripts, arming researchers with a unified, open corpus to overcome annotation inconsistencies and in-domain biases. It introduces TRIDIS, a Parquet-based, metadata-rich resource that aggregates multiple open sub-corpora under semi-diplomatic transcription rules, enabling cross-script and cross-language research. A novel outlier-driven test-split strategy, grounded in joint image-text embeddings, yields a more challenging and realistic assessment of model robustness, with baseline experiments (TrOCR and MiniCPM-Llama3-V 2.5) illustrating a sizable gap on outlier data compared to random splits. The work aims to spur cross-domain transfer and robust HTR/NER methods for heritage documents, with future plans to broaden scripts, languages, and metadata coverage to further mitigate domain drift.

Abstract

This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

Paper Structure

This paper contains 15 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Percentual distribution of Languages, chronologies and Script Families in TRIDIS
  • Figure 2: 3D UMAP distribution Points in red circles are outliers assembled for the test set. This group exhibiting high density of challenge features are typically more than 9 units away from the centroid.
  • Figure 3: Examples of outliers lines from the TRIDIS test set: 1. defunct maistre Jehan Trucan ne mane chanoine / de l'eglise saint. || 2. vel heredum meorum statuentur, et, quam cito sta- || 3. rochianis de Regniaco hominibus || 4. 2 - 40 || 5. L. de Mongeria || 6. T. de Sancto Petro || 7. otros manteles de mesa Romaniscos || 8. sus ac campipartes unius arpentis terre