Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

Jordan Prescott; Thanathai Lertpetchpun; Shrikanth Narayanan

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

Jordan Prescott, Thanathai Lertpetchpun, Shrikanth Narayanan

TL;DR

It is shown that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error and persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.

Abstract

Adversarial perturbations exploit vulnerabilities in automatic speech recognition (ASR) systems while preserving human perceived linguistic content. Neural audio codecs impose a discrete bottleneck that can suppress fine-grained signal variations associated with adversarial noise. We examine how the granularity of this bottleneck, controlled by residual vector quantization (RVQ) depth, shapes adversarial robustness. We observe a non-monotonic trade-off under gradient-based attacks: shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. We further show that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error. These gains persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 3 figures, 2 tables)

This paper contains 17 sections, 2 equations, 3 figures, 2 tables.

Introduction
Background
Neural Audio Codecs
Adversarial Attacks
Existing Defense Mechanisms
Methodology
Threat Model and Attacks
RVQ Bottleneck
Experimental Setup
Results and Discussion
Adversarial Effects Across RVQ Depth
RVQ Token Changes Predict ASR Degradation
Neural Codecs vs. Traditional Defenses under PGD
Adaptive Attack Analysis
Conclusion
...and 2 more sections

Figures (3)

Figure 1: Codec-based inference-time transformation for ASR. Neural audio codecs impose a discrete RVQ bottleneck on adversarial inputs. PGD ignores the codec during optimization, while BPDA+EOT adapts by approximating codec gradients.
Figure 2: CCR (left) and WER (right) versus RVQ depth $N$ under PGD for DAC (top), EnCodec (middle), and Mimi (bottom), evaluated using Whisper. For $\epsilon>0$, CCR increases monotonically with depth while WER exhibits a non-monotonic dependence with a minimum at intermediate $N$. The clean baseline ($\epsilon=0$) is shown for WER only, as CCR is zero and omitted. A subset of depths are shown for clarity.
Figure 3: $\Delta$WER versus CCR under PGD for Wav2Vec 2.0 and Whisper across DAC, EnCodec, and Mimi. Each point represents a depth--budget configuration, with point size indicating RVQ depth $N$ and color denoting perturbation strength $\epsilon$. $\Delta$WER is measured relative to the clean ($\epsilon=0$) baseline. Spearman rank correlations are averaged over all $\epsilon$, with a subset of depths shown for clarity.

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

TL;DR

Abstract

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)