Detecting Generative Parroting through Overfitting Masked Autoencoders
Saeid Asgari Taghanaki, Joseph Lambourne
TL;DR
This work tackles the challenge of generative parroting by training an overfitted Masked Autoencoder (MAE) and using reconstruction loss as a per-sample parrot-detection signal. A simple threshold $ au = L_{ ext{train}}$ on the MAE loss distinguishes parroted from novel content, enabling scalable, real-time detection on CAD sketches from the SketchGraphs dataset. Experiments show that longer training improves detection for seen and modified samples but can increase false positives on novel content; the MAE-based detector also outperforms Weisfeiler-Lehman graph-hash benchmarks under various binning schemes. The approach offers a practical, scalable path to copyright-aware content generation and invites extension to other data modalities and adaptive thresholding aligned with evolving legal standards.
Abstract
The advent of generative AI models has revolutionized digital content creation, yet it introduces challenges in maintaining copyright integrity due to generative parroting, where models mimic their training data too closely. Our research presents a novel approach to tackle this issue by employing an overfitted Masked Autoencoder (MAE) to detect such parroted samples effectively. We establish a detection threshold based on the mean loss across the training dataset, allowing for the precise identification of parroted content in modified datasets. Preliminary evaluations demonstrate promising results, suggesting our method's potential to ensure ethical use and enhance the legal compliance of generative models.
