Table of Contents
Fetching ...

WavMark: Watermarking for Audio Generation

Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, Furu Wei

TL;DR

WavMark presents a novel audio watermarking framework that embeds a 32-bit signature into 1-second audio using invertible neural networks, enabling end-to-end encoding/decoding with accurate watermark localization via Brute Force Detection. The approach achieves high imperceptibility (SNR ≈ 36–38 dB, PESQ ≈ 4.2) and strong robustness across ten attacks, outperforming prior DNN-based and traditional tools, including in utterance-level and synthetic-audio scenarios. By training on 5k hours of multi-domain data and introducing a shift module plus curriculum learning, the method delivers reliable watermark location and decoding with BER as low as 0.48% in realistic conditions, supporting copyright protection and authentication for audio generation. Limitations include support for higher sample rates and handling muted or silent segments, suggesting avenues for real-time and adaptive encoding improvements.

Abstract

Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.

WavMark: Watermarking for Audio Generation

TL;DR

WavMark presents a novel audio watermarking framework that embeds a 32-bit signature into 1-second audio using invertible neural networks, enabling end-to-end encoding/decoding with accurate watermark localization via Brute Force Detection. The approach achieves high imperceptibility (SNR ≈ 36–38 dB, PESQ ≈ 4.2) and strong robustness across ten attacks, outperforming prior DNN-based and traditional tools, including in utterance-level and synthetic-audio scenarios. By training on 5k hours of multi-domain data and introducing a shift module plus curriculum learning, the method delivers reliable watermark location and decoding with BER as low as 0.48% in realistic conditions, supporting copyright protection and authentication for audio generation. Limitations include support for higher sample rates and handling muted or silent segments, suggesting avenues for real-time and adaptive encoding improvements.

Abstract

Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism. Alongside its potential benefits, this powerful technology introduces notable risks, including voice fraud and speaker impersonation. Unlike the conventional approach of solely relying on passive methods for detecting synthetic data, watermarking presents a proactive and robust defence mechanism against these looming risks. This paper introduces an innovative audio watermarking framework that encodes up to 32 bits of watermark within a mere 1-second audio snippet. The watermark is imperceptible to human senses and exhibits strong resilience against various attacks. It can serve as an effective identifier for synthesized voices and holds potential for broader applications in audio copyright protection. Moreover, this framework boasts high flexibility, allowing for the combination of multiple watermark segments to achieve heightened robustness and expanded capacity. Utilizing 10 to 20-second audio as the host, our approach demonstrates an average Bit Error Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over 2800\% in BER compared to the state-of-the-art watermarking tool. See https://aka.ms/wavmark for demos of our work.
Paper Structure (29 sections, 13 equations, 8 figures, 6 tables)

This paper contains 29 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Left: The watermark encoding process of our framework. We iteratively add the same watermark into 1-second segments of the host audio to ensure full-time region protection. Even if the watermarked audio is clipped, decoding is possible using any complete watermark segment. Right: Robustness comparison with the state-of-the-art (SOTA) watermarking tool. Our framework demonstrates comparable watermarking capacity and imperceptibility to the leading watermarking tool while showcasing superior robustness across ten attack scenarios.
  • Figure 2: The overview of our training framework. The encoder combines the host audio and message vector to generate the watermarked audio. A shift module then randomly shifts the decoding window by a small distance. Random attacks are subsequently applied to the shifted audio to corrupt the watermark. Finally, the decoder recovers the message from the attacked audio.
  • Figure 3: Diagram of the utterance-based evaluation. The watermark tool adds multiple segments to the host. Then we destroy the first watermark segment by clipping and subsequently utilize the remaining audio for decoding. The yellow area represents the incomplete watermarked segment.
  • Figure 4: Diagram of the watermark locating test.
  • Figure 5: Host audio samples. The orange region represents the difference between the host audio and the watermarked audio, which is magnified by a factor of 25 for clarity ($25 \times (\mathbf{x}_{wave}-\mathbf{x}'_{wave})$).
  • ...and 3 more figures