Table of Contents
Fetching ...

LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

Rixin Zhou, Peiqiang Qiu, Qian Zhang, Chuntao Li, Xi Yang

TL;DR

This work tackles bronze inscription recognition by building a large, cross-domain BI dataset and a two-stage detect-then-recognize pipeline. It introduces LadderMoE, a parameter-efficient approach that augments a CLIP encoder with ladder-style MoE adapters to enable dynamic expert specialization across heterogeneous BI domains and long-tailed classes. Experiments show state-of-the-art results on both single-character and full-page recognition across color photos, rubbings, and tracings, with strong performance on head, mid, and tail categories and across acquisition modalities. The approach provides a robust foundation for automatic bronze-inscription transcription and supports downstream archaeological analyses such as dating, archaeogeographical studies, and literature retrieval.

Abstract

Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

TL;DR

This work tackles bronze inscription recognition by building a large, cross-domain BI dataset and a two-stage detect-then-recognize pipeline. It introduces LadderMoE, a parameter-efficient approach that augments a CLIP encoder with ladder-style MoE adapters to enable dynamic expert specialization across heterogeneous BI domains and long-tailed classes. Experiments show state-of-the-art results on both single-character and full-page recognition across color photos, rubbings, and tracings, with strong performance on head, mid, and tail categories and across acquisition modalities. The approach provides a robust foundation for automatic bronze-inscription transcription and supports downstream archaeological analyses such as dating, archaeogeographical studies, and literature retrieval.

Abstract

Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

Paper Structure

This paper contains 25 sections, 7 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of of our problem setting and approach for full-page bronze inscription recognition. The illustration emphasizes the cross-domain, degraded, and long-tailed nature of the data and motivates a detection–recognition–ordering framework that is robust to these factors. The resulting structured transcriptions enable downstream archaeological analyses, including bronze dating, archaeogeographical study, and literature retrieval.
  • Figure 2: Framework of our bronze inscriptions recognition model. A Transformer-based encoder is augmented with interleaved MoE-Adapters, and the enriched features are decoded into character code. Each MoE-Adapter consists of multiple experts and a unified router that dynamically selects a sparse subset of experts for the input.
  • Figure 3: Category frequency distribution of the 1,352 selected bronze inscription categories. The distribution exhibits a pronounced long-tailed pattern: a few characters occur thousands of times (up to over 8,000 instances), while the majority of classes appear only sparsely. This imbalance poses a significant challenge for recognition models and motivates our evaluation across Head, Mid, and Tail subsets.
  • Figure 4: Correct recognition examples across frequency groups and domains. The recognition results highlight our model’s robustness to both distribution shifts and domain variations.
  • Figure 5: Detection results of YOLO-v12 on three types of inscription images: (a) rubbing images, (b) tracing images, and (c) color images. The blue bounding boxes denote the detected inscription regions.
  • ...and 2 more figures