Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition
Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu
TL;DR
This work targets decoding in low-resource multilingual ASR by replacing a Huffman-based vocabulary tree with an embedding-driven hierarchical Softmax (H-Softmax) constructed from cross-lingual embeddings. It evaluates two embedding strategies—pre-trained cross-lingual models (XLM, LaBSE) and Mono-Map mappings—coupled with various hierarchical clustering methods to form a more linguistically faithful vocabulary tree. Across 15 languages from Romance, Slavic, and Turkic families, the embedding-based H-Softmax consistently outperforms both the Huffman-based baseline and the standard Softmax, with larger pre-trained models offering the greatest gains in diverse language settings. The results demonstrate stronger cross-lingual token sharing, improved language-agnostic decoding, and valuable insights into tree structure and language identification in multilingual ASR, suggesting practical impact for expanding ASR to many low-resource languages.
Abstract
We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.
