TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER
Sergio Torres Aguilar
TL;DR
The paper addresses robust evaluation for Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) on medieval and early modern manuscripts, arming researchers with a unified, open corpus to overcome annotation inconsistencies and in-domain biases. It introduces TRIDIS, a Parquet-based, metadata-rich resource that aggregates multiple open sub-corpora under semi-diplomatic transcription rules, enabling cross-script and cross-language research. A novel outlier-driven test-split strategy, grounded in joint image-text embeddings, yields a more challenging and realistic assessment of model robustness, with baseline experiments (TrOCR and MiniCPM-Llama3-V 2.5) illustrating a sizable gap on outlier data compared to random splits. The work aims to spur cross-domain transfer and robust HTR/NER methods for heritage documents, with future plans to broaden scripts, languages, and metadata coverage to further mitigate domain drift.
Abstract
This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.
