BIRD: Bronze Inscription Restoration and Dating
Wenjie Hua, Hoang H. Nguyen, Gangyan Ge
TL;DR
This work tackles the challenge of dating and restoring fragmentary Bronze Age Chinese inscriptions by creating BIRD, the first fully encoded NLP-ready corpus with chronological labels and an accompanying Glyph Net for allograph guidance. It combines domain-adaptive and task-adaptive pretraining with an allograph-aware training objective and glyph-biased sampling to address ultra-short text and allography sparsity. Empirical results show that Glyph Net stabilizes restoration while glyph-biased sampling enhances chronological dating, with SikuRoBERTa generally delivering the strongest performance. The dataset and framework enable scalable, NLP-assisted paleography, offering a foundation for integrating textual and archaeological signals in future research.
Abstract
Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.
