GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li
TL;DR
Groot presents a proactive, diffusion-model-based approach to watermarking audio, embedding and later extracting watermarks directly within the diffusion generation pipeline by training a lightweight encoder and a robust decoder around a fixed DM. The method achieves high fidelity and scalable capacity (up to $5000$ bps) while demonstrating strong robustness against both individual and compound post-processing attacks, with extraction accuracies around $95\%$ on average under challenging conditions. Key contributions include a plug-and-play watermarking paradigm for diffusion-based audio, a joint optimization framework that preserves audio quality, and a formal watermark verification procedure via a binomial test. The results indicate Groot’s practical potential for regulating synthesized audio and tracing its source models in real-world deployment.
Abstract
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
