Table of Contents
Fetching ...

Collaborative Watermarking for Adversarial Speech Synthesis

Lauri Juvela, Xin Wang

TL;DR

This work addresses the rise of neural speech synthesis and the need for detectable watermarking by proposing a collaborative training framework where a neural vocoder embeds a watermark that a dedicated detector can reliably extract without harming perceptual quality. By allowing the detector to influence the generator (Collaborator mode) and using differentiable augmentation, the approach yields consistent improvements in detection performance over traditional passive training across clean, noisy, and time-stretched conditions. The study demonstrates the method with HiFi-GAN and ASVspoof 2021 baseline detectors, shows robustness gains, and confirms perceptual quality remains largely unaffected via listening tests. While focused on neural vocoding, the results suggest the approach could extend to full TTS and other generative models, with possible benefits from richer watermark payloads and specialized architectures.

Abstract

Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.

Collaborative Watermarking for Adversarial Speech Synthesis

TL;DR

This work addresses the rise of neural speech synthesis and the need for detectable watermarking by proposing a collaborative training framework where a neural vocoder embeds a watermark that a dedicated detector can reliably extract without harming perceptual quality. By allowing the detector to influence the generator (Collaborator mode) and using differentiable augmentation, the approach yields consistent improvements in detection performance over traditional passive training across clean, noisy, and time-stretched conditions. The study demonstrates the method with HiFi-GAN and ASVspoof 2021 baseline detectors, shows robustness gains, and confirms perceptual quality remains largely unaffected via listening tests. While focused on neural vocoding, the results suggest the approach could extend to full TTS and other generative models, with possible benefits from richer watermark payloads and specialized architectures.

Abstract

Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
Paper Structure (13 sections, 7 equations, 1 figure, 2 tables)

This paper contains 13 sections, 7 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: A Generator can view detector models in three distinct roles: fool the Discriminator to produce more realistic samples, ignore the Observer, or help the Collaborator to extract a watermark from generated speech.