Table of Contents
Fetching ...

ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization

Haaris Mehmood, Karthikeyan Saravanan, Pablo Peso Parada, David Tuckey, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong Jung

TL;DR

A novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting, and demonstrates the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs.

Abstract

Automatic Speech Recognition (ASR) is widely used within consumer devices such as mobile phones. Recently, personalization or on-device model fine-tuning has shown that adaptation of ASR models towards target user speech improves their performance over rare words or accented speech. Despite these gains, fine-tuning on user data (target domain) risks the personalized model to forget knowledge about its original training distribution (source domain) i.e. catastrophic forgetting, leading to subpar general ASR performance. A simple and efficient approach to combat catastrophic forgetting is to measure forgetting via a validation set that represents the source domain distribution. However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. We demonstrate the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs. When measuring the deviations in per user fine-tuning epochs against a 50x larger validation set (oracle), our method achieves a lower mean-absolute-error (3.39) compared to randomly selected subsets of the same size (3.78-8.65). Unlike random baselines, our method consistently tracks the oracle's behaviour across three different forgetting thresholds.

ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization

TL;DR

A novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting, and demonstrates the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs.

Abstract

Automatic Speech Recognition (ASR) is widely used within consumer devices such as mobile phones. Recently, personalization or on-device model fine-tuning has shown that adaptation of ASR models towards target user speech improves their performance over rare words or accented speech. Despite these gains, fine-tuning on user data (target domain) risks the personalized model to forget knowledge about its original training distribution (source domain) i.e. catastrophic forgetting, leading to subpar general ASR performance. A simple and efficient approach to combat catastrophic forgetting is to measure forgetting via a validation set that represents the source domain distribution. However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. We demonstrate the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs. When measuring the deviations in per user fine-tuning epochs against a 50x larger validation set (oracle), our method achieves a lower mean-absolute-error (3.39) compared to randomly selected subsets of the same size (3.78-8.65). Unlike random baselines, our method consistently tracks the oracle's behaviour across three different forgetting thresholds.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: A high-level overview of our proposed method -- Importance based Subsampling of Validation Data.
  • Figure 2: Forgetting histogram and density estimation after evaluating $\mathcal{V}$ on fine-tuned models of $U=72$ pseudo-users for $K=4$ hyper-parameters totalling 288 runs; 15 outlier runs ($<-6\%$ WERR) were removed. The mean (standard deviation) is 1.6 (1.7).
  • Figure 3: An I-F plot for mean values of $V=42$ users using a target threshold of $\Gamma=1.5\%$ for forgetting. Error bars denoting standard deviation are only shown for Oracle, Rand#1 and Canb for brevity. Our proposed method using Canb is closest to the oracle.
  • Figure 4: Box and whiskers plot of epochs trained for oracle, baselines and the proposed method. Random baselines maintain a skewed distribution towards either the minimum or maximum epochs. The all correct baseline maintains a static uniform-like distribution. The proposed method closely follows oracle across three thresholds.
  • Figure : $\pmb{\textsc{FineTune}\xspace}(\theta, \mathcal{D}, \mathcal{E}, \Gamma, \Omega)$ with early stopping.