Table of Contents
Fetching ...

Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform

Yichuan Zhang, Chengxin Li, Yujie Gu

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of protecting intellectual property and tracing usage for high-quality text-to-speech diffusion models. It introduces Smark, a universal watermarking framework that embeds watermarks during the shared reverse-diffusion process by performing discrete wavelet transforms on Mel spectrograms and inserting the watermark into the low-frequency LL sub-band. A lightweight embedder and extractor operate within this LL channel, and a joint optimization scheme balances perceptual audio fidelity with watermark extraction accuracy, validated across multiple TTS diffusion models and datasets. Extensive ablation studies and hypothesis-based watermark verification demonstrate robust performance against realistic attacks, highlighting Smark as a practical, model-agnostic solution for speech authentication and copyright protection.

Abstract

Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.

Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of protecting intellectual property and tracing usage for high-quality text-to-speech diffusion models. It introduces Smark, a universal watermarking framework that embeds watermarks during the shared reverse-diffusion process by performing discrete wavelet transforms on Mel spectrograms and inserting the watermark into the low-frequency LL sub-band. A lightweight embedder and extractor operate within this LL channel, and a joint optimization scheme balances perceptual audio fidelity with watermark extraction accuracy, validated across multiple TTS diffusion models and datasets. Extensive ablation studies and hypothesis-based watermark verification demonstrate robust performance against realistic attacks, highlighting Smark as a practical, model-agnostic solution for speech authentication and copyright protection.

Abstract

Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.

Paper Structure

This paper contains 28 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The pipeline of Smark. First, during the reverse diffusion process of a TTS diffusion model, the Mel spectrogram $X'_t$ at timestep $t$ is decomposed to four sub-bands LL, LH, HL, HH using DWT. Next, the watermark $\mathbf{m}$ is embedded into the LL sub-band of $X'_t$ via a watermark embedder (see Fig. \ref{['fig:Watermark Embedder and Extractor']}). Then, the watermarked LL sub-band is combined with the unmodified LH, HL, HH sub-bands to reconstruct the watermarked Mel spectrogram $\tilde{X}'_t$ via the inverse transform IDWT, which is passed to the next reverse diffusion step. After the diffusion process is complete, the watermark $\mathbf{m}'$ is extracted using a learned extractor and compared to the original $\mathbf{m}$ for verification, under both the no-attack and various simulated real-world attack scenarios.
  • Figure 2: Watermark embedder and extractor
  • Figure 3: A test result for binomial distribution under hypotheses $H_0$ and $H_1$, with watermark length $N=100$ and sample size 1000. For $H_0$, the bit wise accuracy is $\hat{\xi}\approx 0.5331$ (i.e., the average of gray bars); for $H_1$, $\hat{\xi}\approx 0.9983$ (i.e., the average of blue bars). The verification threshold $\tau \in [0.62, 0.97]$ ensures both low FPR and low FNR.
  • Figure 4: Fidelity of Smark across capacities on LJSpeech for GradTTS and WaveGrad (no attack)
  • Figure 5: Comparison of Smark's performance under composite attacks across different capacities, on GradTTS (first row), WaveGrad (second row), and PriorGrad (third row) using the LJSpeech dataset.
  • ...and 1 more figures