Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance
Francisco Messina, Francesca Ronchini, Luca Comanducci, Paolo Bestagini, Fabio Antonacci
TL;DR
This work tackles data memorization in diffusion-based text-to-audio models by introducing Anti-Memorization Guidance (AMG), an inference-time framework that steers samples away from memorized training content. AMG combines three complementary guidance strategies—despecification, caption deduplication, and dissimilarity—to suppress memorization while preserving semantic alignment and audio quality, demonstrated on the open-source Stable Audio Open model. Experimental results show significant reductions in similarity to training data, with ablations highlighting dissimilarity guidance as particularly effective, and an overall improvement in audio quality metrics like FAD, challenging the assumption that mitigation harms fidelity. The approach offers a practical path to reduce copyright-related risks in generative audio systems and paves the way for extensions to other modalities and training-time safeguards.
Abstract
A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.
