Proactive Detection of Voice Cloning with Localized Watermarking

Robin San Roman; Pierre Fernandez; Alexandre Défossez; Teddy Furon; Tuan Tran; Hady Elsahar

Proactive Detection of Voice Cloning with Localized Watermarking

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar

TL;DR

AudioSeal tackles the risk of voice cloning by introducing a proactive, localized watermarking framework for AI-generated speech. It uses a generator–detector pair trained with a novel TF-Loudness perceptual loss and a masked-sample localization objective to embed imperceptible watermarks detectable at the sample level ($1/16k$ s) and, optionally, multi-bit messages for attribution. The method delivers state-of-the-art robustness to common audio edits, precise localization down to single samples, and two orders of magnitude faster detection than prior watermarking approaches, making it suitable for real-time, large-scale deployment. A security analysis shows that keeping the detector weights private is important to resist adversarial watermark removal, highlighting practical deployment considerations for AI content provenance APIs.

Abstract

In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications.

Proactive Detection of Voice Cloning with Localized Watermarking

TL;DR

s) and, optionally, multi-bit messages for attribution. The method delivers state-of-the-art robustness to common audio edits, precise localization down to single samples, and two orders of magnitude faster detection than prior watermarking approaches, making it suitable for real-time, large-scale deployment. A security analysis shows that keeping the detector weights private is important to resist adversarial watermark removal, highlighting practical deployment considerations for AI content provenance APIs.

Abstract

Paper Structure (43 sections, 5 equations, 10 figures, 10 tables)

This paper contains 43 sections, 5 equations, 10 figures, 10 tables.

Introduction
Related Work
Synthetic speech detection.
Imperceptible watermarking.
Method
Training pipeline
Losses
Perceptual losses
Masked sample-level detection loss.
Multi-bit watermarking
Training details
Detection, localization and attribution
Audio/Speech Quality
Experiments and Evaluation
Comparison with passive classifier
...and 28 more sections

Figures (10)

Figure 1: Proactive detection of AI-generated speech. We embed an imperceptible watermark in the audio, which can be used to detect if a speech is AI-generated and identify the model that generated it. It can also precisely pinpoint AI-generated segments in a longer audio with a sample level resolution (1/16k seconds).
Figure 2: Generator-detector training pipeline.
Figure 3: (Top) A speech signal ( gray) where the watermark is present between 5 and 7.5 seconds ( orange, magnified by 5). (Bottom) The output of the detector for every time step. An orange background color indicates the presence of the watermark.
Figure 4: Architectures. The generator is made of an encoder and a decoder both derived from EnCodec's design, with optional message embeddings. The encoder includes convolutional blocks and an LSTM, while the decoder mirrors this structure with transposed convolutions. The detector is made of an encoder and a transpose convolution, followed by a linear layer that calculates sample-wise logits. Optionally, multiple linear layers can be used for calculating k-bit messages. More details in App. \ref{['app:arch']}.
Figure 5: Localization results across different durations of watermarked audio signals in terms of Sample-Level Accuracy and Intersection Over Union (IoU) metrics ($\uparrow$ is better).
...and 5 more figures

Proactive Detection of Voice Cloning with Localized Watermarking

TL;DR

Abstract

Proactive Detection of Voice Cloning with Localized Watermarking

Authors

TL;DR

Abstract

Table of Contents

Figures (10)