CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters
Zishuo Feng, Feng Cao
TL;DR
This work tackles converting Hanyu Pinyin abbreviations into Chinese characters, a challenging CSC task due to limited information per abbreviation. It introduces CNMBERT, a fill-mask BERT variant that encodes pinyin initials as dedicated mask tokens and incorporates MoE layers to route different tokens through specialized experts. On a 10,373-sample test set, CNMBERT achieves state-of-the-art-like MRR scores, significantly outperforming a fine-tuned Qwen baseline and GPT-4o, with ablation studies confirming the contributions of both the multi-mask strategy and MoE augmentation. The approach promises improved downstream performance for tasks like named-entity recognition and sentiment analysis, and points to future work in reducing ambiguity for long or context-poor abbreviations and extending the method to other languages.
Abstract
The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.
