Table of Contents
Fetching ...

Diff-Oracle: Deciphering Oracle Bone Scripts with Controllable Diffusion Model

Jing Li, Qiu-Feng Wang, Siyuan Wang, Rui Zhang, Kaizhu Huang, Erik Cambria

TL;DR

Diff-Oracle introduces a diffusion-based framework for controllable oracle bone script generation and recognition. It jointly learns a style encoder that maps style images to CLIP-compatible embeddings and a content encoder trained via pixel-level paired data produced by CUT, enabling precise control over both style and glyph content. A two-stage training strategy disentangles style and content, and PAIR-like multi-modal guidance with independent content/style scales enables diverse, high-fidelity generation. Empirically, Diff-Oracle achieves state-of-the-art generation metrics and large recognition gains, including 84.62% zero-shot accuracy on OBC306, proving its potential as a practical tool for decipherment and archaeological analysis.

Abstract

Deciphering oracle bone scripts plays an important role in Chinese archaeology and philology. However, a significant challenge remains due to the scarcity of oracle character images. To overcome this issue, we propose Diff-Oracle, a novel approach based on diffusion models to generate a diverse range of controllable oracle characters. Unlike traditional diffusion models that operate primarily on text prompts, Diff-Oracle incorporates a style encoder that utilizes style reference images to control the generation style. This encoder extracts style prompts from existing oracle character images, where style details are converted into a text embedding format via a pretrained language-vision model. On the other hand, a content encoder is integrated within Diff-Oracle to capture specific content details from content reference images, ensuring that the generated characters accurately represent the intended glyphs. To effectively train Diff-Oracle, we pre-generate pixel-level paired oracle character images (i.e., style and content images) by an image-to-image translation model. Extensive qualitative and quantitative experiments are conducted on datasets Oracle-241 and OBC306. While significantly surpassing present generative methods in terms of image generation, Diff-Oracle substantially benefits downstream oracle character recognition, outperforming all existing SOTAs by a large margin. In particular, on the challenging OBC306 dataset, Diff-Oracle leads to an accuracy gain of 7.70% in the zero-shot setting and is able to recognize unseen oracle character images with the accuracy of 84.62%, achieving a new benchmark for deciphering oracle bone scripts.

Diff-Oracle: Deciphering Oracle Bone Scripts with Controllable Diffusion Model

TL;DR

Diff-Oracle introduces a diffusion-based framework for controllable oracle bone script generation and recognition. It jointly learns a style encoder that maps style images to CLIP-compatible embeddings and a content encoder trained via pixel-level paired data produced by CUT, enabling precise control over both style and glyph content. A two-stage training strategy disentangles style and content, and PAIR-like multi-modal guidance with independent content/style scales enables diverse, high-fidelity generation. Empirically, Diff-Oracle achieves state-of-the-art generation metrics and large recognition gains, including 84.62% zero-shot accuracy on OBC306, proving its potential as a practical tool for decipherment and archaeological analysis.

Abstract

Deciphering oracle bone scripts plays an important role in Chinese archaeology and philology. However, a significant challenge remains due to the scarcity of oracle character images. To overcome this issue, we propose Diff-Oracle, a novel approach based on diffusion models to generate a diverse range of controllable oracle characters. Unlike traditional diffusion models that operate primarily on text prompts, Diff-Oracle incorporates a style encoder that utilizes style reference images to control the generation style. This encoder extracts style prompts from existing oracle character images, where style details are converted into a text embedding format via a pretrained language-vision model. On the other hand, a content encoder is integrated within Diff-Oracle to capture specific content details from content reference images, ensuring that the generated characters accurately represent the intended glyphs. To effectively train Diff-Oracle, we pre-generate pixel-level paired oracle character images (i.e., style and content images) by an image-to-image translation model. Extensive qualitative and quantitative experiments are conducted on datasets Oracle-241 and OBC306. While significantly surpassing present generative methods in terms of image generation, Diff-Oracle substantially benefits downstream oracle character recognition, outperforming all existing SOTAs by a large margin. In particular, on the challenging OBC306 dataset, Diff-Oracle leads to an accuracy gain of 7.70% in the zero-shot setting and is able to recognize unseen oracle character images with the accuracy of 84.62%, achieving a new benchmark for deciphering oracle bone scripts.
Paper Structure (32 sections, 6 equations, 13 figures, 5 tables)

This paper contains 32 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Example of an oracle bone (a) and its related scanned rubbings (b).
  • Figure 2: Modern Chinese characters (a) and their corresponding scanned oracle characters (b). Given a scanned oracle character image (b) as reference style and a handprinted image (c) as reference content, Diff-Oracle is able to generate realistic and controllable samples (d). Images in the same row belong to the same class.
  • Figure 3: Data distributions of training and test sets from OBC306 OBC306 sorted by class cardinality.
  • Figure 4: Overall architecture of Diff-Oracle including four blocks: Autoencoder, Stable Diffusion, Content Learning, and Style Learning. During the training process, an oracle character image $x$, a pixel-level matched content image $x_c$, and a style image $x_s$ ($x=x_s$ here) are input to the model. Then, style and content information can be extracted from the style encoder $\tau_s$ and the content encoder $\tau_c$, respectively. Meanwhile, Encoder in the Autoencoder block extracts features from $x$, which places the diffusion process in the latent space. Finally, based on these extracted features, the Stable Diffusion block is fine-tuned, and the style encoder $\tau_s$ and the content encoder $\tau_c$ are trained. In the generation phase, given a handprinted image as content $x'_c$ and a scanned image as style $x'_s$, a new oracle character $\tilde{x}$ can be generated by Diff-Oracle from random noise $z_T$, which has the same content as $x'_c$ and the same style as $x'_s$.
  • Figure 5: Architecture of style encoder $\tau_s$ comprises three modules: CLIP Image Encoder $\tau_{s1}$, Multi-layer Attention $MultiAtt$ and CLIP Text Encoder $\tau_{s2}$. Style input is initially processed by $\tau_{s1}$ to obtain the visual embedding, followed by $MultiAtt$ to emphasize the style information, and ultimately by $\tau_{s2}$ to obtain the style information in text embedding format.
  • ...and 8 more figures