Deciphering Oracle Bone Language with Diffusion Models
Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu
TL;DR
Oracle Bone Script remains largely undeciphered due to limited corpora and fragmentary inscriptions. OBSD introduces a conditional diffusion framework with Localized Structural Sampling (LSS) and a zero-shot refinement module to translate OBS images into modern Chinese forms, leveraging a forward diffusion process $q(X_t|X_{t-1})$ and a denoising objective $L = \mathbb{E}_{\epsilon,\gamma} \|\epsilon - f_\theta(\tilde{X}, X_t, \gamma)\|^2$; it also uses an offset loss $\mathcal{L}_{\text{offset}} = \mathrm{mean}(\|\delta_{\text{offset}}\|)$. The paper reports that OBSD outperforms adapted image-translation baselines on the HUST-OBS and EVOBC datasets, with increasing Top-N accuracy and qualitative improvements in reconstructed characters. This work provides a viable AI-assisted decipherment pathway for ancient scripts and lays groundwork for applying diffusion-based methods to other hieroglyphic or pictographic languages.
Abstract
Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD.
