Table of Contents
Fetching ...

Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning

Yi Shi, Congyi Wang, Yu Chen, Bin Wang

TL;DR

This work tackles Mandarin polyphone disambiguation within text-to-speech by introducing Semi-PPL, a semi-supervised learning framework that leverages large-scale unlabeled text to improve grapheme-to-phoneme disambiguation. The approach uses a compact base model built on tiny-Electra and a Conv-BLSTM classifier, enhanced with consistency regularization across text augmentations and entropy-driven pseudo labeling, including dictionary-assisted labeling for monophonic words. Experiments on a dedicated dataset (326K labeled training sentences, 1,100 challenging test sentences, and 25.4M unlabeled lines) show state-of-the-art accuracy with substantially reduced model complexity compared to Bert-based methods. The authors also publish a sizable labeled benchmark to promote further research and practical deployment in Mandarin G2P systems.

Abstract

The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations. As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates. This process is called Polyphone Disambiguation. Although the problem has been well explored with both knowledge-based and learning-based approaches, it remains challenging due to the lack of publicly available labeled datasets and the irregular nature of polyphone in Mandarin Chinese. In this paper, we propose a novel semi-supervised learning (SSL) framework for Mandarin Chinese polyphone disambiguation that can potentially leverage unlimited unlabeled text data. We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance. In addition, we publish a novel dataset specifically for the polyphone disambiguation task to promote further research.

Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning

TL;DR

This work tackles Mandarin polyphone disambiguation within text-to-speech by introducing Semi-PPL, a semi-supervised learning framework that leverages large-scale unlabeled text to improve grapheme-to-phoneme disambiguation. The approach uses a compact base model built on tiny-Electra and a Conv-BLSTM classifier, enhanced with consistency regularization across text augmentations and entropy-driven pseudo labeling, including dictionary-assisted labeling for monophonic words. Experiments on a dedicated dataset (326K labeled training sentences, 1,100 challenging test sentences, and 25.4M unlabeled lines) show state-of-the-art accuracy with substantially reduced model complexity compared to Bert-based methods. The authors also publish a sizable labeled benchmark to promote further research and practical deployment in Mandarin G2P systems.

Abstract

The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations. As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates. This process is called Polyphone Disambiguation. Although the problem has been well explored with both knowledge-based and learning-based approaches, it remains challenging due to the lack of publicly available labeled datasets and the irregular nature of polyphone in Mandarin Chinese. In this paper, we propose a novel semi-supervised learning (SSL) framework for Mandarin Chinese polyphone disambiguation that can potentially leverage unlimited unlabeled text data. We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance. In addition, we publish a novel dataset specifically for the polyphone disambiguation task to promote further research.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A classic case of pronunciation ambiguity caused by a polyphonic character. The above examples are selected from our evaluation set.
  • Figure 2: The network architecture of our proposed approach. The characters in the blue boxes are polyphonic characters.
  • Figure 3: The framework of the semi-supervised learning on texts contained polyphonic characters. The upper part represents the training pipeline for labeled data. The lower image describes the procedure of processing unlabeled text data.