Table of Contents
Fetching ...

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

Jordan Prescott, Thanathai Lertpetchpun, Shrikanth Narayanan

TL;DR

It is shown that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error and persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.

Abstract

Adversarial perturbations exploit vulnerabilities in automatic speech recognition (ASR) systems while preserving human perceived linguistic content. Neural audio codecs impose a discrete bottleneck that can suppress fine-grained signal variations associated with adversarial noise. We examine how the granularity of this bottleneck, controlled by residual vector quantization (RVQ) depth, shapes adversarial robustness. We observe a non-monotonic trade-off under gradient-based attacks: shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. We further show that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error. These gains persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

TL;DR

It is shown that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error and persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.

Abstract

Adversarial perturbations exploit vulnerabilities in automatic speech recognition (ASR) systems while preserving human perceived linguistic content. Neural audio codecs impose a discrete bottleneck that can suppress fine-grained signal variations associated with adversarial noise. We examine how the granularity of this bottleneck, controlled by residual vector quantization (RVQ) depth, shapes adversarial robustness. We observe a non-monotonic trade-off under gradient-based attacks: shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. We further show that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error. These gains persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.
Paper Structure (17 sections, 2 equations, 3 figures, 2 tables)

This paper contains 17 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Codec-based inference-time transformation for ASR. Neural audio codecs impose a discrete RVQ bottleneck on adversarial inputs. PGD ignores the codec during optimization, while BPDA+EOT adapts by approximating codec gradients.
  • Figure 2: CCR (left) and WER (right) versus RVQ depth $N$ under PGD for DAC (top), EnCodec (middle), and Mimi (bottom), evaluated using Whisper. For $\epsilon>0$, CCR increases monotonically with depth while WER exhibits a non-monotonic dependence with a minimum at intermediate $N$. The clean baseline ($\epsilon=0$) is shown for WER only, as CCR is zero and omitted. A subset of depths are shown for clarity.
  • Figure 3: $\Delta$WER versus CCR under PGD for Wav2Vec 2.0 and Whisper across DAC, EnCodec, and Mimi. Each point represents a depth--budget configuration, with point size indicating RVQ depth $N$ and color denoting perturbation strength $\epsilon$. $\Delta$WER is measured relative to the clean ($\epsilon=0$) baseline. Spearman rank correlations are averaged over all $\epsilon$, with a subset of depths shown for clarity.