Table of Contents
Fetching ...

Training Large ASR Encoders with Differential Privacy

Geeticka Chauhan, Steve Chien, Om Thakkar, Abhradeep Thakurta, Arun Narayanan

TL;DR

This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method and introduces a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs.

Abstract

Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR. With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data. In this paper, we apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public. This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method. Notably, we introduce a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs. Our approach yields a LibriSpeech test-clean/other WER (%) of 3.78/ 8.41 with ($10$, 1e^-9)-DP for extrapolation towards low dataset scales, and 2.81/ 5.89 with (10, 7.9e^-11)-DP for extrapolation towards high scales.

Training Large ASR Encoders with Differential Privacy

TL;DR

This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method and introduces a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs.

Abstract

Self-supervised learning (SSL) methods for large speech models have proven to be highly effective at ASR. With the interest in public deployment of large pre-trained models, there is a rising concern for unintended memorization and leakage of sensitive data points from the training data. In this paper, we apply differentially private (DP) pre-training to a SOTA Conformer-based encoder, and study its performance on a downstream ASR task assuming the fine-tuning data is public. This paper is the first to apply DP to SSL for ASR, investigating the DP noise tolerance of the BEST-RQ pre-training method. Notably, we introduce a novel variant of model pruning called gradient-based layer freezing that provides strong improvements in privacy-utility-compute trade-offs. Our approach yields a LibriSpeech test-clean/other WER (%) of 3.78/ 8.41 with (, 1e^-9)-DP for extrapolation towards low dataset scales, and 2.81/ 5.89 with (10, 7.9e^-11)-DP for extrapolation towards high scales.
Paper Structure (18 sections, 1 equation, 4 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: The Differentially Private pre-training method for ASR encoder involving clipping per-example gradients from the minibatch, and addition of calibrated Gaussian noise. Gradients with norms below clip value are not clipped, as shown above. Once private pre-training of the ASR encoder is done, fine-tuning is done publicly after attaching an ASR decoder and using CTC loss gulati2020conformergraves2006connectionist
  • Figure 2: Extrapolating the noise multiplier linearly with batch size and dataset size to maintain the signal-to-noise ratio and improve privacy accounting.
  • Figure 3: Performance from tuning our LayerFreeze with different percentage of parameters frozen, while keeping the DP noise multiplier constant at $1 \mathrm{e}\hbox{-3}$. Along the x-axis, we use $p$ to refer to the % of parameters consisting of layers with the highest accumulated gradient norms. We run experiments with freezing either the $p$% parameters, or the remaining $(1-p)$%. To save on compute, fine-tuning is done using an early pre-train checkpoint of 200k, assuming that the same conclusions hold for 1M.
  • Figure 4: Most extreme setting: Scaling up the noise multiplier linearly with batch size and other independent parameters to maintain the signal to noise ratio. All other training dynamics remain unchanged with the assumption that the utility would remain the same.

Theorems & Definitions (1)

  • Definition 1: ($\epsilon$, $\delta)$-DP