Table of Contents
Fetching ...

Multi-bit Audio Watermarking

Luca A. Lanzendörfer, Kyle Fearne, Florian Grötschla, Roger Wattenhofer

TL;DR

Timbru tackles the challenge of robust, imperceptible audio watermarking for 44.1 kHz stereo content without training embedder-detector models. It achieves this by post-hoc gradient optimization that perturbs latent representations within a pretrained Open VAE, encoding a multi-bit watermark detectable via a CLAP-based extractor. The method combines a hinge-based message loss with a perceptual loss and trains against simulated attacks to promote robustness while preserving perceptual quality, outperforming prior methods on average bit error rate and showing resilience to unseen regeneration attacks. This approach offers a flexible, dataset-free alternative for protecting existing content and enabling provenance verification without bespoke training pipelines.

Abstract

We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across common filtering, noise, compression, resampling, cropping, and regeneration attacks. Our approach attains the best average bit error rates, while preserving perceptual quality, demonstrating an efficient, dataset-free path to imperceptible audio watermarking.

Multi-bit Audio Watermarking

TL;DR

Timbru tackles the challenge of robust, imperceptible audio watermarking for 44.1 kHz stereo content without training embedder-detector models. It achieves this by post-hoc gradient optimization that perturbs latent representations within a pretrained Open VAE, encoding a multi-bit watermark detectable via a CLAP-based extractor. The method combines a hinge-based message loss with a perceptual loss and trains against simulated attacks to promote robustness while preserving perceptual quality, outperforming prior methods on average bit error rate and showing resilience to unseen regeneration attacks. This approach offers a flexible, dataset-free alternative for protecting existing content and enabling provenance verification without bespoke training pipelines.

Abstract

We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across common filtering, noise, compression, resampling, cropping, and regeneration attacks. Our approach attains the best average bit error rates, while preserving perceptual quality, demonstrating an efficient, dataset-free path to imperceptible audio watermarking.

Paper Structure

This paper contains 3 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our proposed approach. The raw waveform $A_R$ is first transformed into the latent representation using a pretrained Stable Audio Open VAE. To embed a watermark, minor perturbations are added to this intermediate representation. At every step, this representation is decoded back into a waveform ($A_W$) and then augmented to simulate a variety of attacks. The perceptual loss and the message loss from the decoded message are then used to calculate the gradient which optimizes the perturbations. All other components remain frozen.
  • Figure 2: (Left) Mean bit recovery rate ($\mathrm{BRR}=(1-\mathrm{BER})/100$) for 16-bit payload over optimization steps shows the longer we run Timbru, the more robust the embedded watermark becomes. (Right) Ablation where each point represents the mean BRR for watermarked audio with specific payload length, showing how the mean BRR and the perceptual quality change as the payload length increases.