Table of Contents
Fetching ...

Detecting Generative Parroting through Overfitting Masked Autoencoders

Saeid Asgari Taghanaki, Joseph Lambourne

TL;DR

This work tackles the challenge of generative parroting by training an overfitted Masked Autoencoder (MAE) and using reconstruction loss as a per-sample parrot-detection signal. A simple threshold $ au = L_{ ext{train}}$ on the MAE loss distinguishes parroted from novel content, enabling scalable, real-time detection on CAD sketches from the SketchGraphs dataset. Experiments show that longer training improves detection for seen and modified samples but can increase false positives on novel content; the MAE-based detector also outperforms Weisfeiler-Lehman graph-hash benchmarks under various binning schemes. The approach offers a practical, scalable path to copyright-aware content generation and invites extension to other data modalities and adaptive thresholding aligned with evolving legal standards.

Abstract

The advent of generative AI models has revolutionized digital content creation, yet it introduces challenges in maintaining copyright integrity due to generative parroting, where models mimic their training data too closely. Our research presents a novel approach to tackle this issue by employing an overfitted Masked Autoencoder (MAE) to detect such parroted samples effectively. We establish a detection threshold based on the mean loss across the training dataset, allowing for the precise identification of parroted content in modified datasets. Preliminary evaluations demonstrate promising results, suggesting our method's potential to ensure ethical use and enhance the legal compliance of generative models.

Detecting Generative Parroting through Overfitting Masked Autoencoders

TL;DR

This work tackles the challenge of generative parroting by training an overfitted Masked Autoencoder (MAE) and using reconstruction loss as a per-sample parrot-detection signal. A simple threshold on the MAE loss distinguishes parroted from novel content, enabling scalable, real-time detection on CAD sketches from the SketchGraphs dataset. Experiments show that longer training improves detection for seen and modified samples but can increase false positives on novel content; the MAE-based detector also outperforms Weisfeiler-Lehman graph-hash benchmarks under various binning schemes. The approach offers a practical, scalable path to copyright-aware content generation and invites extension to other data modalities and adaptive thresholding aligned with evolving legal standards.

Abstract

The advent of generative AI models has revolutionized digital content creation, yet it introduces challenges in maintaining copyright integrity due to generative parroting, where models mimic their training data too closely. Our research presents a novel approach to tackle this issue by employing an overfitted Masked Autoencoder (MAE) to detect such parroted samples effectively. We establish a detection threshold based on the mean loss across the training dataset, allowing for the precise identification of parroted content in modified datasets. Preliminary evaluations demonstrate promising results, suggesting our method's potential to ensure ethical use and enhance the legal compliance of generative models.
Paper Structure (10 sections, 2 equations, 1 figure, 1 table)

This paper contains 10 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Representative samples from the datasets: $D_{\text{train}}$ (original training set), $D_{\text{var 1}}$ (first variation), and $D_{\text{var 2}}$ (second variation), shown from the first to third rows, respectively