Table of Contents
Fetching ...

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance

Francisco Messina, Francesca Ronchini, Luca Comanducci, Paolo Bestagini, Fabio Antonacci

TL;DR

This work tackles data memorization in diffusion-based text-to-audio models by introducing Anti-Memorization Guidance (AMG), an inference-time framework that steers samples away from memorized training content. AMG combines three complementary guidance strategies—despecification, caption deduplication, and dissimilarity—to suppress memorization while preserving semantic alignment and audio quality, demonstrated on the open-source Stable Audio Open model. Experimental results show significant reductions in similarity to training data, with ablations highlighting dissimilarity guidance as particularly effective, and an overall improvement in audio quality metrics like FAD, challenging the assumption that mitigation harms fidelity. The approach offers a practical path to reduce copyright-related risks in generative audio systems and paves the way for extensions to other modalities and training-time safeguards.

Abstract

A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.

Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance

TL;DR

This work tackles data memorization in diffusion-based text-to-audio models by introducing Anti-Memorization Guidance (AMG), an inference-time framework that steers samples away from memorized training content. AMG combines three complementary guidance strategies—despecification, caption deduplication, and dissimilarity—to suppress memorization while preserving semantic alignment and audio quality, demonstrated on the open-source Stable Audio Open model. Experimental results show significant reductions in similarity to training data, with ablations highlighting dissimilarity guidance as particularly effective, and an overall improvement in audio quality metrics like FAD, challenging the assumption that mitigation harms fidelity. The approach offers a practical path to reduce copyright-related risks in generative audio systems and paves the way for extensions to other modalities and training-time safeguards.

Abstract

A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.

Paper Structure

This paper contains 16 sections, 13 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example spectrograms of original training audio track (a) and generated using the same textual prompt without (b) and with AMG (c).
  • Figure 2: Similarity matrices computed on the same audio track considered in Fig \ref{['fig:spec_comparison']} without (a) and with AMG (b).
  • Figure 3: T-SNE visualization of embeddings from the considered dataset and from audio generated with (Full AMG) and without AMG (Memorization), using CLAPlaion (a) and MERT (b) as embedding extractors.
  • Figure 4: Histogram of similarity score distributions computed over embeddings extracted via CLAPlaion (a) and MERT (b).