Multi-bit Audio Watermarking
Luca A. Lanzendörfer, Kyle Fearne, Florian Grötschla, Roger Wattenhofer
TL;DR
Timbru tackles the challenge of robust, imperceptible audio watermarking for 44.1 kHz stereo content without training embedder-detector models. It achieves this by post-hoc gradient optimization that perturbs latent representations within a pretrained Open VAE, encoding a multi-bit watermark detectable via a CLAP-based extractor. The method combines a hinge-based message loss with a perceptual loss and trains against simulated attacks to promote robustness while preserving perceptual quality, outperforming prior methods on average bit error rate and showing resilience to unseen regeneration attacks. This approach offers a flexible, dataset-free alternative for protecting existing content and enabling provenance verification without bespoke training pipelines.
Abstract
We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across common filtering, noise, compression, resampling, cropping, and regeneration attacks. Our approach attains the best average bit error rates, while preserving perceptual quality, demonstrating an efficient, dataset-free path to imperceptible audio watermarking.
