Table of Contents
Fetching ...

Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems

Weifei Jin, Yuxin Cao, Junjie Su, Derui Wang, Yedi Zhang, Minhui Xue, Jie Hao, Jin Song Dong, Yixian Yang

TL;DR

This work introduces AudioShield, a real-time privacy-preserving framework against commercial and LLM-powered ASR systems. It leverages Transferable Universal Adversarial Perturbations in Latent Space (LS-TUAP) and a target feature adaptation mechanism to achieve high universality, transferability to unseen models, and preserved audio quality by perturbing latent representations rather than the audio itself. The approach incorporates protection preparation, perturbation generation, and robustness to over-the-air conditions via a VAE-based encoder/decoder and RIR-based physical modeling, validated across cloud APIs, LLM-powered ASR, NN-based models, and voice assistants. Extensive experiments demonstrate superior protection performance and audio quality, along with resilience to adaptive countermeasures, making a strong case for real-world privacy protection in mass speech surveillance. However, limitations include inconsistent outputs across target models and semantic coherence challenges, which point to directions for future work in multi-model consistency and output coherence.

Abstract

The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.

Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems

TL;DR

This work introduces AudioShield, a real-time privacy-preserving framework against commercial and LLM-powered ASR systems. It leverages Transferable Universal Adversarial Perturbations in Latent Space (LS-TUAP) and a target feature adaptation mechanism to achieve high universality, transferability to unseen models, and preserved audio quality by perturbing latent representations rather than the audio itself. The approach incorporates protection preparation, perturbation generation, and robustness to over-the-air conditions via a VAE-based encoder/decoder and RIR-based physical modeling, validated across cloud APIs, LLM-powered ASR, NN-based models, and voice assistants. Extensive experiments demonstrate superior protection performance and audio quality, along with resilience to adaptive countermeasures, making a strong case for real-world privacy protection in mass speech surveillance. However, limitations include inconsistent outputs across target models and semantic coherence challenges, which point to directions for future work in multi-model consistency and output coherence.

Abstract

The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.

Paper Structure

This paper contains 32 sections, 1 theorem, 11 equations, 11 figures, 15 tables, 1 algorithm.

Key Result

Theorem 1

Assume that the deterministic component of the decoder $\mathcal{D}$ is $a$-Lipschitz, and given two independent latent codes $z_1$ and $z_2$, then for $\forall r \in {\mathcal{R}^ + }$,

Figures (11)

  • Figure 1: Overview of large-scale speech communication surveillance scenarios without/with AudioShield.
  • Figure 2: The architecture of a typical ASR system.
  • Figure 3: Protection scenario of AudioShield.
  • Figure 4: Workflow of AudioShield. The three main steps of the whole process are: protection preparation, perturbation generation and target feature adaptation.
  • Figure 5: Illustration of perturbation generation and target feature adaptation.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1