Table of Contents
Fetching ...

EduceLab-Scrolls: Verifiable Recovery of Text from Herculaneum Papyri using X-ray CT

Stephen Parsons, C. Seth Parker, Christy Chapman, Mami Hayashida, W. Brent Seales

TL;DR

This work tackles the noninvasive recovery of hidden texts in the Herculaneum papyri by introducing EduceLab-Scrolls, a large aligned multimodal dataset that links high-resolution X-ray CT volumes with 2D infrared/spectral labels. It presents a novel pipeline that combines machine learning with a geometric framework (Volume Cartographer) to generate surface volumes and map 2D labels to 3D data, enabling ink-detection inside subsurface scroll layers. The method yields detectable carbon ink signals on fragment surfaces and subsurface layers, with validation from ground-truth images and papyrologist assessments, demonstrating readable recovered text and minimal hallucination. This open dataset and reproducible approach establish a scalable pathway to reading intact Herculaneum scrolls and offer a generalizable framework for other heritage objects requiring multimodal alignment for hidden-text recovery.

Abstract

We present a complete software pipeline for revealing the hidden texts of the Herculaneum papyri using X-ray CT images. This enhanced virtual unwrapping pipeline combines machine learning with a novel geometric framework linking 3D and 2D images. We also present EduceLab-Scrolls, a comprehensive open dataset representing two decades of research effort on this problem. EduceLab-Scrolls contains a set of volumetric X-ray CT images of both small fragments and intact, rolled scrolls. The dataset also contains 2D image labels that are used in the supervised training of an ink detection model. Labeling is enabled by aligning spectral photography of scroll fragments with X-ray CT images of the same fragments, thus creating a machine-learnable mapping between image spaces and modalities. This alignment permits supervised learning for the detection of "invisible" carbon ink in X-ray CT, a task that is "impossible" even for human expert labelers. To our knowledge, this is the first aligned dataset of its kind and is the largest dataset ever released in the heritage domain. Our method is capable of revealing accurate lines of text on scroll fragments with known ground truth. Revealed text is verified using visual confirmation, quantitative image metrics, and scholarly review. EduceLab-Scrolls has also enabled the discovery, for the first time, of hidden texts from the Herculaneum papyri, which we present here. We anticipate that the EduceLab-Scrolls dataset will generate more textual discovery as research continues.

EduceLab-Scrolls: Verifiable Recovery of Text from Herculaneum Papyri using X-ray CT

TL;DR

This work tackles the noninvasive recovery of hidden texts in the Herculaneum papyri by introducing EduceLab-Scrolls, a large aligned multimodal dataset that links high-resolution X-ray CT volumes with 2D infrared/spectral labels. It presents a novel pipeline that combines machine learning with a geometric framework (Volume Cartographer) to generate surface volumes and map 2D labels to 3D data, enabling ink-detection inside subsurface scroll layers. The method yields detectable carbon ink signals on fragment surfaces and subsurface layers, with validation from ground-truth images and papyrologist assessments, demonstrating readable recovered text and minimal hallucination. This open dataset and reproducible approach establish a scalable pathway to reading intact Herculaneum scrolls and offer a generalizable framework for other heritage objects requiring multimodal alignment for hidden-text recovery.

Abstract

We present a complete software pipeline for revealing the hidden texts of the Herculaneum papyri using X-ray CT images. This enhanced virtual unwrapping pipeline combines machine learning with a novel geometric framework linking 3D and 2D images. We also present EduceLab-Scrolls, a comprehensive open dataset representing two decades of research effort on this problem. EduceLab-Scrolls contains a set of volumetric X-ray CT images of both small fragments and intact, rolled scrolls. The dataset also contains 2D image labels that are used in the supervised training of an ink detection model. Labeling is enabled by aligning spectral photography of scroll fragments with X-ray CT images of the same fragments, thus creating a machine-learnable mapping between image spaces and modalities. This alignment permits supervised learning for the detection of "invisible" carbon ink in X-ray CT, a task that is "impossible" even for human expert labelers. To our knowledge, this is the first aligned dataset of its kind and is the largest dataset ever released in the heritage domain. Our method is capable of revealing accurate lines of text on scroll fragments with known ground truth. Revealed text is verified using visual confirmation, quantitative image metrics, and scholarly review. EduceLab-Scrolls has also enabled the discovery, for the first time, of hidden texts from the Herculaneum papyri, which we present here. We anticipate that the EduceLab-Scrolls dataset will generate more textual discovery as research continues.
Paper Structure (38 sections, 1 equation, 17 figures, 4 tables, 3 algorithms)

This paper contains 38 sections, 1 equation, 17 figures, 4 tables, 3 algorithms.

Figures (17)

  • Figure 1: Ink detection results for Herculaneum Fragments. (a) Ground truth infrared photographs of fragment surfaces. (b) Our method (Volume Cartographer + ink-ID) on fragment surfaces, generated purely from X-ray CT. Cross-validation used to prevent model memorization. (c) Our method on subsurface hidden layers, revealing text that has not been seen in nearly 2,000 years. (d) Greek transcriptions of (c). ] and [ indicate line beginning and end. Dot indicates indistinct ink traces, underdot indicates uncertain transcription.
  • Figure 2: Ground truth and our output for the binary ink classification task on P.Herc.Paris. 1 fr. 34.
  • Figure 3: Greek transcriptions for P.Herc.Paris. 1 fr. 34 from a trained papyrologist, of both the ground truth image and our generated image. ] and [ indicate line beginning and end. A dot indicates indistinct ink traces and an underdot indicates an uncertain transcription.
  • Figure 4: EduceLab-Scrolls dataset geometry for an example fragment. (a) Scroll fragment, RGB photograph. (b) Volumetric X-ray CT image. (c) 3D surface segmentation. (d) Flattened "surface volume" sampled about the segmented surface mesh. (e) Infrared photograph. (f) Infrared photograph aligned to surface volume. (g) Aligned binary ink labels.
  • Figure 5: Visible wavelength RGB and 1000nm infrared images of P.Herc.Paris. 2 fr. 47, revealing improved ink contrast in infrared.
  • ...and 12 more figures