Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT
Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng
TL;DR
The paper tackles Mandarin polyphone disambiguation within G2P for TTS by proposing an end-to-end framework that combines a pre-trained BERT encoder with a neural-network classifier. It leverages semantic features extracted from raw Chinese character sequences, avoiding preprocessing, and uses unshared output layers per polyphonic character to predict pronunciation based on context. Three classifier variants—fully-connected, BLSTM, and Transformer block—are evaluated, all benefiting from BERT features and contextual information, with the LSTM-based variant performing best among them. Experiments on a Tencent AI Lab dataset show significant improvements over a strong LSTM baseline, and analyses including attention visualization and PCA embeddings corroborate the importance of proximate contextual information. The approach also offers scalable handling of new polyphonic characters by adding corresponding output layers without retraining existing ones, enhancing practical deployment for Mandarin G2P in TTS.
Abstract
Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.
