Exploiting Vulnerabilities in Speech Translation Systems through Targeted Adversarial Attacks
Chang Liu, Haolin Wu, Xi Yang, Kui Zhang, Cong Wu, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, Jie Zhang
TL;DR
The paper investigates vulnerabilities of state-of-the-art speech translation systems to targeted adversarial attacks, introducing two strategies: perturbation-based manipulations of source audio and diffusion-based adversarial music generation. It augments perturbations with Multi-language Enhancement and Target Cycle Optimization to improve cross-language transfer, and demonstrates diffusion-guided music attacks that reliably steer translations toward predefined semantics, including across Seen and Unseen languages. Extensive evaluations on Seamless and Canary show meaningful attack effectiveness, with robust transfer across models and a substantial, but imperfect, over-the-air attack feasibility including physical devices. Defense experiments indicate partial resilience to audio processing but no definitive remedy, underscoring the need for robust ST architectures and defense mechanisms to safeguard multilingual audio pipelines in real-world settings.
Abstract
As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two innovative approaches: (1) the injection of perturbation into source audio, and (2) the generation of adversarial music designed to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks prove effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. The implications of this research extend beyond immediate security concerns, shedding light on the interpretability and robustness of neural speech processing systems. Our findings underscore the need for advanced defense mechanisms and more resilient architectures in the realm of audio systems. More details and samples can be found at https://adv-st.github.io.
