Table of Contents
Fetching ...

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

Hongfei Xue, Qijie Shao, Kaixun Huang, Peikun Chen, Jie Liu, Lei Xie

TL;DR

Multilingual ASR remains challenging due to diverse linguistic and acoustic conditions. This work introduces SSHR, a method that leverages self-supervised hierarchical representations in MMS by extracting a language-related frame from middle layers and enhancing final-layer content through Cross-CTC and SSL-parameter refinements, with a combined loss $L_{all} = (1 - w) L_{ctc} + w \frac{1}{k} \sum L_{ctc}^{j}$. Through layer-wise analysis and extensive experiments on Common Voice and ML-SUPERB, SSHR achieves state-of-the-art results and consistent relative improvements over direct fine-tuning. The contributions include a principled way to utilize middle-layer language information, a targeted mechanism to reinforce final-layer content, and empirical demonstrations of their effectiveness in low-resource multilingual ASR. This approach highlights the value of harnessing SSL hierarchies to improve cross-language transcription and offers practical strategies for improving downstream multilingual ASR systems.

Abstract

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.

SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition

TL;DR

Multilingual ASR remains challenging due to diverse linguistic and acoustic conditions. This work introduces SSHR, a method that leverages self-supervised hierarchical representations in MMS by extracting a language-related frame from middle layers and enhancing final-layer content through Cross-CTC and SSL-parameter refinements, with a combined loss . Through layer-wise analysis and extensive experiments on Common Voice and ML-SUPERB, SSHR achieves state-of-the-art results and consistent relative improvements over direct fine-tuning. The contributions include a principled way to utilize middle-layer language information, a targeted mechanism to reinforce final-layer content, and empirical demonstrations of their effectiveness in low-resource multilingual ASR. This approach highlights the value of harnessing SSL hierarchies to improve cross-language transcription and offers practical strategies for improving downstream multilingual ASR systems.

Abstract

Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.
Paper Structure (17 sections, 3 equations, 2 figures, 4 tables)

This paper contains 17 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The overall framework of the proposed SSHR. The Inter.CTC Loss is the CTC loss of the $j$-th layer, while the Inter.Prob is the posterior probabilities of the Inter.CTC Loss.
  • Figure 2: ACC with LID labels and MI with phoneme labels.