LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition
Rixin Zhou, Peiqiang Qiu, Qian Zhang, Chuntao Li, Xi Yang
TL;DR
This work tackles bronze inscription recognition by building a large, cross-domain BI dataset and a two-stage detect-then-recognize pipeline. It introduces LadderMoE, a parameter-efficient approach that augments a CLIP encoder with ladder-style MoE adapters to enable dynamic expert specialization across heterogeneous BI domains and long-tailed classes. Experiments show state-of-the-art results on both single-character and full-page recognition across color photos, rubbings, and tracings, with strong performance on head, mid, and tail categories and across acquisition modalities. The approach provides a robust foundation for automatic bronze-inscription transcription and supports downstream archaeological analyses such as dating, archaeogeographical studies, and literature retrieval.
Abstract
Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.
