Table of Contents
Fetching ...

ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features

Peng Cheng, Yuwei Wang, Peng Huang, Zhongjie Ba, Xiaodong Lin, Feng Lin, Li Lu, Kui Ren

TL;DR

This paper introduces ALIF, a low-cost black-box adversarial audio framework that perturbs linguistic embeddings rather than raw waveforms to attack ASR. By exploiting the reciprocal relationship between ASR and TTS, ALIF targets the decision boundary in a low-dimensional embedding space, achieving much higher query efficiency and robustness to model updates. It presents two attack schemes: ALIF-OTL for digital-domain subtitling and ALIF-OTA for over-the-air attacks on APIs and voice assistants, with extensive experiments showing substantial improvements in efficiency (up to ~97.7% over prior work) and strong success rates across multiple commercial systems. The work also analyzes defenses, robustness factors, and limitations, highlighting the practical implications for security in speech-enabled platforms and outlining avenues for future defense and hardening strategies.

Abstract

Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. Building on this observation, we propose ALIF, the first black-box adversarial linguistic feature-based attack pipeline. We leverage the reciprocal process of text-to-speech (TTS) and ASR models to generate perturbations in the linguistic embedding space where the decision boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment on four commercial ASRs and voice assistants. Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve query efficiency by 97.7% and 73.3%, respectively, while achieving competitive performance compared to existing methods. Notably, ALIF-OTL can generate an attack sample with only one query. Furthermore, our test-of-time experiment validates the robustness of our approach against ASR updates.

ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features

TL;DR

This paper introduces ALIF, a low-cost black-box adversarial audio framework that perturbs linguistic embeddings rather than raw waveforms to attack ASR. By exploiting the reciprocal relationship between ASR and TTS, ALIF targets the decision boundary in a low-dimensional embedding space, achieving much higher query efficiency and robustness to model updates. It presents two attack schemes: ALIF-OTL for digital-domain subtitling and ALIF-OTA for over-the-air attacks on APIs and voice assistants, with extensive experiments showing substantial improvements in efficiency (up to ~97.7% over prior work) and strong success rates across multiple commercial systems. The work also analyzes defenses, robustness factors, and limitations, highlighting the practical implications for security in speech-enabled platforms and outlining avenues for future defense and hardening strategies.

Abstract

Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. Building on this observation, we propose ALIF, the first black-box adversarial linguistic feature-based attack pipeline. We leverage the reciprocal process of text-to-speech (TTS) and ASR models to generate perturbations in the linguistic embedding space where the decision boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment on four commercial ASRs and voice assistants. Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve query efficiency by 97.7% and 73.3%, respectively, while achieving competitive performance compared to existing methods. Notably, ALIF-OTL can generate an attack sample with only one query. Furthermore, our test-of-time experiment validates the robustness of our approach against ASR updates.
Paper Structure (39 sections, 10 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 39 sections, 10 equations, 9 figures, 13 tables, 2 algorithms.

Figures (9)

  • Figure 1: Limitations of existing adversarial attacks based on adversarial examples (AEs). The traditional pipeline for generating black-box audio adversarial examples necessitates the attacker making many API queries, which is both costly and time-consuming. The attacker can then obtain an example capable of successfully attacking the API. However, the example's effectiveness is significantly diminished by model updates.
  • Figure 2: Architectures of an ASR and a TTS system.
  • Figure 3: System model of ALIF-OTL. This depicts an attacker incorporating an adversarial audio track into a video. As a result, when the manipulated video is uploaded to the platform, the automatic subtitling service inadvertently generates inappropriate text. The ALIF-OTL attack occurs within the digital domain.
  • Figure 4: System model of ALIF-OTA. This model illustrates two potential attack scenarios. In the first scenario (upper part), the smartphone apps with voice interaction capabilities record the attack audio signals in the environment and then call online APIs to transcribe audio. In the second scenario (lower part), an attacker plays attack audio samples to activate voice assistants - such as smart speakers - that possess their own ASR backend, thereby executing the commands.
  • Figure 5: Contrast of AE-based attacks and our work.
  • ...and 4 more figures