Table of Contents
Fetching ...

BIRD: Bronze Inscription Restoration and Dating

Wenjie Hua, Hoang H. Nguyen, Gangyan Ge

TL;DR

This work tackles the challenge of dating and restoring fragmentary Bronze Age Chinese inscriptions by creating BIRD, the first fully encoded NLP-ready corpus with chronological labels and an accompanying Glyph Net for allograph guidance. It combines domain-adaptive and task-adaptive pretraining with an allograph-aware training objective and glyph-biased sampling to address ultra-short text and allography sparsity. Empirical results show that Glyph Net stabilizes restoration while glyph-biased sampling enhances chronological dating, with SikuRoBERTa generally delivering the strongest performance. The dataset and framework enable scalable, NLP-assisted paleography, offering a foundation for integrating textual and archaeological signals in future research.

Abstract

Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.

BIRD: Bronze Inscription Restoration and Dating

TL;DR

This work tackles the challenge of dating and restoring fragmentary Bronze Age Chinese inscriptions by creating BIRD, the first fully encoded NLP-ready corpus with chronological labels and an accompanying Glyph Net for allograph guidance. It combines domain-adaptive and task-adaptive pretraining with an allograph-aware training objective and glyph-biased sampling to address ultra-short text and allography sparsity. Empirical results show that Glyph Net stabilizes restoration while glyph-biased sampling enhances chronological dating, with SikuRoBERTa generally delivering the strongest performance. The dataset and framework enable scalable, NLP-assisted paleography, offering a foundation for integrating textual and archaeological signals in future research.

Abstract

Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.

Paper Structure

This paper contains 29 sections, 4 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Left: A simplified paleographer’s workflow for restoring a damaged bronze inscription: identifying the damaged fragment, inferring from parallel expressions, and proposing a restoration hubei1wuzhenfengxiemingwen. Right: A damaged bronze inscription fragment (CCYZBI.02838A) cass with the expert’s inferred reading huanghai. The workflow mirrors a masked language modeling setup, where restorations are hypothesized from local context and attested parallel expressions.
  • Figure 2: Concrete glyph family of Qi ('to pray') from the Shang to the Eastern Zhou. To illustrate the correlation between glyphs and their components, Ideographic Description Sequences (IDS) are used.
  • Figure 3: Our pipeline enhances masked language modeling for bronze inscriptions by combining domain-adaptive pretraining (DAPT), task-adaptive pretraining (TAPT), and Glyph Net module (as illustrated in the lower-right component, each grapheme $G_{1..n}$ is linked to its allographs $A_{1..n}$) that integrates allograph glyph information into a BERT or RoBERTa backbone.
  • Figure 4: Left: Rubbing of the Hu Ding inscriptions (CCYZBI.02838A, 02838B) cass, image courtesy of AS DABII. Right: Transcription from huanghai, used as the input with damaged positions masked.
  • Figure 5: Examples of undeciphered glyphs represented by UNK placeholders in BIRD.