Table of Contents
Fetching ...

SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

Yue Li, Weizhi Liu, Dongdong Lin

TL;DR

SOLIDO addresses the need for efficient and robust watermarking in diffusion-based speech synthesis by integrating parameter-efficient fine-tuning via Low-Rank Adaptation with a watermark encoder/decoder. The method employs a three-phase pipeline plus an attack simulator and a speech-driven lightweight fine-tuning strategy to maintain speech fidelity while enabling high-accuracy watermark extraction, even at a large capacity of 2000 bps. Extensive experiments show SOLIDO achieves near-perfect extraction accuracy under varied individual and compound attacks, and handles variable-length inputs that challenge many baselines, offering strong model-protection and content-authentication capabilities. The practical impact lies in providing a scalable, low-overhead solution for copyright protection and content provenance in AI-generated speech systems that rely on diffusion models.

Abstract

The accelerated advancement of speech generative models has given rise to security issues, including model infringement and unauthorized abuse of content. Although existing generative watermarking techniques have proposed corresponding solutions, most methods require substantial computational overhead and training costs. In addition, some methods have limitations in robustness when handling variable-length inputs. To tackle these challenges, we propose \textsc{SOLIDO}, a novel generative watermarking method that integrates parameter-efficient fine-tuning with speech watermarking through low-rank adaptation (LoRA) for speech diffusion models. Concretely, the watermark encoder converts the watermark to align with the input of diffusion models. To achieve precise watermark extraction from variable-length inputs, the watermark decoder based on depthwise separable convolution is designed for watermark recovery. To further enhance speech generation performance and watermark extraction capability, we propose a speech-driven lightweight fine-tuning strategy, which reduces computational overhead through LoRA. Comprehensive experiments demonstrate that the proposed method ensures high-fidelity watermarked speech even at a large capacity of 2000 bps. Furthermore, against common individual and compound speech attacks, our SOLIDO achieves a maximum average extraction accuracy of 99.20\% and 98.43\%, respectively. It surpasses other state-of-the-art methods by nearly 23\% in resisting time-stretching attacks.

SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation

TL;DR

SOLIDO addresses the need for efficient and robust watermarking in diffusion-based speech synthesis by integrating parameter-efficient fine-tuning via Low-Rank Adaptation with a watermark encoder/decoder. The method employs a three-phase pipeline plus an attack simulator and a speech-driven lightweight fine-tuning strategy to maintain speech fidelity while enabling high-accuracy watermark extraction, even at a large capacity of 2000 bps. Extensive experiments show SOLIDO achieves near-perfect extraction accuracy under varied individual and compound attacks, and handles variable-length inputs that challenge many baselines, offering strong model-protection and content-authentication capabilities. The practical impact lies in providing a scalable, low-overhead solution for copyright protection and content provenance in AI-generated speech systems that rely on diffusion models.

Abstract

The accelerated advancement of speech generative models has given rise to security issues, including model infringement and unauthorized abuse of content. Although existing generative watermarking techniques have proposed corresponding solutions, most methods require substantial computational overhead and training costs. In addition, some methods have limitations in robustness when handling variable-length inputs. To tackle these challenges, we propose \textsc{SOLIDO}, a novel generative watermarking method that integrates parameter-efficient fine-tuning with speech watermarking through low-rank adaptation (LoRA) for speech diffusion models. Concretely, the watermark encoder converts the watermark to align with the input of diffusion models. To achieve precise watermark extraction from variable-length inputs, the watermark decoder based on depthwise separable convolution is designed for watermark recovery. To further enhance speech generation performance and watermark extraction capability, we propose a speech-driven lightweight fine-tuning strategy, which reduces computational overhead through LoRA. Comprehensive experiments demonstrate that the proposed method ensures high-fidelity watermarked speech even at a large capacity of 2000 bps. Furthermore, against common individual and compound speech attacks, our SOLIDO achieves a maximum average extraction accuracy of 99.20\% and 98.43\%, respectively. It surpasses other state-of-the-art methods by nearly 23\% in resisting time-stretching attacks.

Paper Structure

This paper contains 32 sections, 17 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of different watermarking techniques with full-parameters and parameter-efficient fine-tuning strategy. The proposed method leverages low-rank adaptation to enable single-step model training while drastically reducing the number of trainable parameters.
  • Figure 2: The pipeline of the proposed solido. In the watermarking phase, the watermark encoder encodes the watermark into the latent variable, which is then combined with the original Gaussian-sampled input to acquire modified input. The diffusion model with LoRA takes this modified input to generate the watermarked speech in the generating phase. During the watermark extracting phase, the watermarked speech initially undergoes the attack simulator and is subsequently fed into the watermark decoder to recover the watermark.
  • Figure 3: Architecture of SOLIDO.
  • Figure 4: Capacity Analysis Across Different PEFT.
  • Figure 5: Robustness against various rates of rear-segment cropping attacks.