Table of Contents
Fetching ...

Speech Watermarking with Discrete Intermediate Representations

Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, Zhou Zhao

TL;DR

DiscreteWM addresses the security gap in voice cloning by watermarking speech in a robust discrete latent space using a VQVAE. Watermarks are embedded through modular relations on discrete token IDs, with a frame-wise strategy and a localizer/restorer for reliable extraction; a Z-test enables utterance-level AI-detection. The system achieves state-of-the-art robustness and imperceptibility, supports 1–150 bits per second, and speeds watermarking detection significantly compared with sliding-window methods. This approach offers a practical, flexible solution for both information hiding and proactive AI-generated content detection in real-world speech applications.

Abstract

Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity. Audio samples are available at https://DiscreteWM.github.io/discrete_wm.

Speech Watermarking with Discrete Intermediate Representations

TL;DR

DiscreteWM addresses the security gap in voice cloning by watermarking speech in a robust discrete latent space using a VQVAE. Watermarks are embedded through modular relations on discrete token IDs, with a frame-wise strategy and a localizer/restorer for reliable extraction; a Z-test enables utterance-level AI-detection. The system achieves state-of-the-art robustness and imperceptibility, supports 1–150 bits per second, and speeds watermarking detection significantly compared with sliding-window methods. This approach offers a practical, flexible solution for both information hiding and proactive AI-generated content detection in real-world speech applications.

Abstract

Speech watermarking techniques can proactively mitigate the potential harmful consequences of instant voice cloning techniques. These techniques involve the insertion of signals into speech that are imperceptible to humans but can be detected by algorithms. Previous approaches typically embed watermark messages into continuous space. However, intuitively, embedding watermark information into robust discrete latent space can significantly improve the robustness of watermarking systems. In this paper, we propose DiscreteWM, a novel speech watermarking framework that injects watermarks into the discrete intermediate representations of speech. Specifically, we map speech into discrete latent space with a vector-quantized autoencoder and inject watermarks by changing the modular arithmetic relation of discrete IDs. To ensure the imperceptibility of watermarks, we also propose a manipulator model to select the candidate tokens for watermark embedding. Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, DiscreteWM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity. Audio samples are available at https://DiscreteWM.github.io/discrete_wm.

Paper Structure

This paper contains 40 sections, 8 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration for speech watermarking strategies. Upper: The embedder learns to encode the watermark string into the continuous space with imperceptibility loss and watermark loss. Lower: In our discrete scheme, the vector-quantized variational autoencoder (VQVAE) maps speech into discrete latent space, and the manipulator conceals the watermark string within the modulus relations of discrete token IDs.
  • Figure 2: The overall architecture of DiscreteWM. "VQ" represents the "vector quantization" operation, and Ⓒ denotes the concatenation operation. During the watermark embedding process, the manipulator forces the discrete tokens to have the same modular arithmetic relation with the watermark message, as indicated by the red dashed line. For instance, if we intend to conceal the value "1" into the last discrete token, the manipulator will selectively sample from the odd tokens (highlighted in green) according to their probability distribution. The original token will then be replaced with the sampled token that has the highest probability (the 5th token). In watermark extraction, the localizer is responsible for watermark localization, while the restorer focuses on recovering the watermark message.
  • Figure 3: Visualizations of the ground-truth and watermarked mel-spectrograms by different speech watermarking methods. For a fair comparison, we directly download the example from WavMark's demo page and use the pre-trained Chang Liu's model.
  • Figure 4: The tradeoff between reliability and imperceptibility on the AI-generated content detection task. "Z-statistic = 4.0" is shown as the red dashed line.
  • Figure 5: The structure of the VQ encoder, the masked decoder, and the manipulator.