Table of Contents
Fetching ...

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

TL;DR

This work examines a content-centric misuse vector for TTS enabled by large audio-language models, demonstrating that harmful speech can be produced even when safeguards are in place. It introduces HarmGen, a suite of five attacks that exploit both text and audio modalities to bypass safety alignment and moderation, and evaluates them across five commercial LALMs and three multilingual datasets. The study finds that advanced attacks significantly reduce model refusals and increase toxicity, with Concat and Shuffle often outperforming audio-based methods, and highlights gaps in both reactive deepfake detection and moderation pipelines. It concludes that proactive, cross-modal safeguards at training and deployment are necessary to curb the risks of high-fidelity, audio-based content manipulation.

Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

TL;DR

This work examines a content-centric misuse vector for TTS enabled by large audio-language models, demonstrating that harmful speech can be produced even when safeguards are in place. It introduces HarmGen, a suite of five attacks that exploit both text and audio modalities to bypass safety alignment and moderation, and evaluates them across five commercial LALMs and three multilingual datasets. The study finds that advanced attacks significantly reduce model refusals and increase toxicity, with Concat and Shuffle often outperforming audio-based methods, and highlights gaps in both reactive deepfake detection and moderation pipelines. It concludes that proactive, cross-modal safeguards at training and deployment are necessary to curb the risks of high-fidelity, audio-based content manipulation.

Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

Paper Structure

This paper contains 35 sections, 11 figures, 3 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of HarmGen.
  • Figure 2: The overview of the two baseline attacks and the five advanced attacks. A random harmful text is selected and the target group is replaced with "*****".
  • Figure 3: Effectiveness of advanced attacks. Note: (1) For each model and dataset, we exclude those inputs on which the B$_1$ attack succeeds at least once within the 10 trials, hence the R$_2$ of B$_1$ equals 100%. (2) Some models do not support multiple audio inputs, so we exclude those inputs with more than one toxic word for the Read, Spell, and Phoneme attacks. (3) An empty bar denotes that the corresponding data is unavailable (N/A) due to one of the following reasons: the B$_1$ achieves 0% refusal rate, so no data point is evaluated; we do not evaluate the Spell and Phoneme attacks on the Chinese dataset Mul-ZH since a Chinese word cannot be spelled and the Phoneme attack requires Pin-yin pinying, which we leave as a future work; some models do not support audio inputs, so B$_2$, Read, Spell, Phoneme attacks are not evaluated; Gemini-2.5-live model cannot interpret Read, Spell, Phoneme attacks, so these attacks are not evaluated. (4) To distinguish a result of 0 from the N/A result, we intentionally raise the height of the bar corresponding to 0 and annotate it with a "0" marker.
  • Figure 4: Impact of output voice style
  • Figure 5: Effectiveness across different harmful categories
  • ...and 6 more figures