Table of Contents
Fetching ...

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

Zhengdong Yang, Qianying Liu, Sheng Li, Fei Cheng, Chenhui Chu

TL;DR

This work targets decoding in low-resource multilingual ASR by replacing a Huffman-based vocabulary tree with an embedding-driven hierarchical Softmax (H-Softmax) constructed from cross-lingual embeddings. It evaluates two embedding strategies—pre-trained cross-lingual models (XLM, LaBSE) and Mono-Map mappings—coupled with various hierarchical clustering methods to form a more linguistically faithful vocabulary tree. Across 15 languages from Romance, Slavic, and Turkic families, the embedding-based H-Softmax consistently outperforms both the Huffman-based baseline and the standard Softmax, with larger pre-trained models offering the greatest gains in diverse language settings. The results demonstrate stronger cross-lingual token sharing, improved language-agnostic decoding, and valuable insights into tree structure and language identification in multilingual ASR, suggesting practical impact for expanding ASR to many low-resource languages.

Abstract

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

TL;DR

This work targets decoding in low-resource multilingual ASR by replacing a Huffman-based vocabulary tree with an embedding-driven hierarchical Softmax (H-Softmax) constructed from cross-lingual embeddings. It evaluates two embedding strategies—pre-trained cross-lingual models (XLM, LaBSE) and Mono-Map mappings—coupled with various hierarchical clustering methods to form a more linguistically faithful vocabulary tree. Across 15 languages from Romance, Slavic, and Turkic families, the embedding-based H-Softmax consistently outperforms both the Huffman-based baseline and the standard Softmax, with larger pre-trained models offering the greatest gains in diverse language settings. The results demonstrate stronger cross-lingual token sharing, improved language-agnostic decoding, and valuable insights into tree structure and language identification in multilingual ASR, suggesting practical impact for expanding ASR to many low-resource languages.

Abstract

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The flowchart of our framework for ASR with H-Softmax. The blue line represents how the H-Softmax network is determined, and the red line represents how the ASR model is trained. The green area at the bottom shows the detail of the proposed hierarchical clustering of cross-lingual embeddings.
  • Figure 2: A typical H-Softmax tree structure. Leaf $w_3$ has a virtual child with the same probability of aligning each leaf node to the same depth, so it is conceptually possible for path vectorization.
  • Figure 3: CER% on Catalan and Ukrainian with different proportions of languages from the same group in the training data. The composition of each training data is shown in the box at the bottom-left corner.
  • Figure 4: Correlation plots of CER% and Bilingual Language Induction p@1 on Romance Languages for LABSE, Mono-Map, and XLM-base. The X-axis represents the ASR performance of the language. The Y-axis represents the BLI p@1 score of each language as source language. The blue dots represents the performance of each language. The red line in the figure represents the linear regression fit applied to the data points across languages.