Table of Contents
Fetching ...

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

Ashish Seth, Ramaneswaran Selvakumar, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

TL;DR

EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling) is presented, a novel self-supervised learning approach for speech representation learning that outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.

Abstract

In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and solving them simultaneously, the model is able to learn more effective representations and thereby acquire a more comprehensive understanding of the speech. Quantitatively, EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a thorough analysis to show that the regions masked by EH-MAM effectively capture useful context across speech frames.

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

TL;DR

EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling) is presented, a novel self-supervised learning approach for speech representation learning that outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.

Abstract

In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning to create challenging problems, such as identifying harder frames and solving them simultaneously, the model is able to learn more effective representations and thereby acquire a more comprehensive understanding of the speech. Quantitatively, EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a thorough analysis to show that the regions masked by EH-MAM effectively capture useful context across speech frames.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Eh-MAM compared to random masking schemes employed widely in the literature. Eh-MAM first identifies which frames to mask using a Teacher model and then solves the MAM task by reconstructing the selected masked regions using a Student model.
  • Figure 2: Increase in relative WER using selective and random masking schemes. During inference, under similar experimental settings, we selectively mask the frames with high reconstruction values and compare it against random masking. The former consistently shows a significant increase in relative WER than the later, thereby indicating that these frames capture more useful context for speech reconstruction as a result of capturing more information, Thus building on this result we hypothesize that asking a model to reconstruct these frames will result in stronger learning signals.
  • Figure 3: Illustration of Eh-MAM SSL algorithm. Eh-MAM employs the self-distillation SSL framework that consists of identical student and teacher networks. At each training iteration, the teacher is updated by the exponential moving average (EMA) of the student. ① For a speech input $Z$, we first use the teacher network to identify the speech frames that are hard to reconstruct, also called as hard regions. To achieve this, we predict the frame-level reconstruction loss values $\mathcal{L}^t_p$ using a loss predictor $d_{\delta^t}$ by feeding $Z$ to the teacher network. ② Next, we utilize our easy-to-hard masking strategy to identify the mask indices $M^S$ associated with hard regions, followed by progressively introducing them with random mask indices $M^R$ over each epoch. ③ Finally, a masked variant $\tilde{Z}$ is fed to the student network, where it is tasked to ④ reconstruct masked regions by optimizing a reconstruction loss (as shown in Eqtn. \ref{['eq:rec']}) and ⑤ train a loss predictor $d_{\delta^s}$ by computing an auxiliary loss between predicted and original reconstruction loss values, $\mathcal{L}^s_p$ and $\mathcal{L}^{rec}$ respectively (as shown in Eqtn. \ref{['eq:aux']}).
  • Figure 4: For a random speech utterance, we show the variation in frame-level reconstruction loss values across training epochs. During the initial stages of Eh-MAM pre-training, we find that the model exhibits high frame-level reconstruction loss values, which results in low distinctiveness amongst individual values. This leads to increased stochasticity in the selective masking.
  • Figure 5: We compare the increase in relative Word Error Rate (WER) by selectively masking hard regions predicted by the loss predictor (Eh-MAM Masking) Vs randomly masking frames. The increase in relative WER indicates that the Eh-MAM Masking scheme masks useful context in an input.
  • ...and 1 more figures