Table of Contents
Fetching ...

CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

Zishuo Feng, Feng Cao

TL;DR

This work tackles converting Hanyu Pinyin abbreviations into Chinese characters, a challenging CSC task due to limited information per abbreviation. It introduces CNMBERT, a fill-mask BERT variant that encodes pinyin initials as dedicated mask tokens and incorporates MoE layers to route different tokens through specialized experts. On a 10,373-sample test set, CNMBERT achieves state-of-the-art-like MRR scores, significantly outperforming a fine-tuned Qwen baseline and GPT-4o, with ablation studies confirming the contributions of both the multi-mask strategy and MoE augmentation. The approach promises improved downstream performance for tasks like named-entity recognition and sentiment analysis, and points to future work in reducing ambiguity for long or context-poor abbreviations and extending the method to other languages.

Abstract

The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.

CNMBERT: A Model for Converting Hanyu Pinyin Abbreviations to Chinese Characters

TL;DR

This work tackles converting Hanyu Pinyin abbreviations into Chinese characters, a challenging CSC task due to limited information per abbreviation. It introduces CNMBERT, a fill-mask BERT variant that encodes pinyin initials as dedicated mask tokens and incorporates MoE layers to route different tokens through specialized experts. On a 10,373-sample test set, CNMBERT achieves state-of-the-art-like MRR scores, significantly outperforming a fine-tuned Qwen baseline and GPT-4o, with ablation studies confirming the contributions of both the multi-mask strategy and MoE augmentation. The approach promises improved downstream performance for tasks like named-entity recognition and sentiment analysis, and points to future work in reducing ambiguity for long or context-poor abbreviations and extending the method to other languages.

Abstract

The task of converting Hanyu Pinyin abbreviations to Chinese characters is a significant branch within the domain of Chinese Spelling Correction (CSC). It plays an important role in many downstream applications such as named entity recognition and sentiment analysis. This task typically involves text-length alignment and seems easy to solve; however, due to the limited information content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we treat this as a fill-mask task and propose CNMBERT, which stands for zh-CN Pinyin Multi-mask BERT Model, as a solution to this issue. By introducing a multi-mask strategy and Mixture of Experts (MoE) layers, CNMBERT outperforms fine-tuned GPT models and ChatGPT-4o with a 61.53% MRR score and 51.86% accuracy on a 10,373-sample test dataset.

Paper Structure

This paper contains 15 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a). A simple example of pinyin Abbreviation can make listener confused. The correct characters are highlighted in blue. In this context, “hxx” refers to “幻想乡(Gensokyo)”. (b). The example of some platforms may designate words related to money and illness as prohibited words.
  • Figure 2: The overall architecture of the model and its workflow. The model use a 16-layer transformer architecture. For layers 0–15, some FFN layers are replaced with MoE layers, each containing a shared expert. Specifically: In layer [1, 3, 5], there are 2 experts and top-k = 1. In layers [7], there are 4 experts and top-k = 1. In layers [9, 11, 13, 15], there are 8 experts and top-k = 2. And for other layers using the regular FFN.
  • Figure 3: The scores using MRR@5 for predictions of words with different lengths.
  • Figure 4: Results of predict monosyllabic and polysyllabic words’ pinyin.
  • Figure 5: The feature space of experts in layer 3(a-c) and layer 7(d-f).