Detecting Generative Parroting through Overfitting Masked Autoencoders

Saeid Asgari Taghanaki; Joseph Lambourne

Detecting Generative Parroting through Overfitting Masked Autoencoders

Saeid Asgari Taghanaki, Joseph Lambourne

TL;DR

This work tackles the challenge of generative parroting by training an overfitted Masked Autoencoder (MAE) and using reconstruction loss as a per-sample parrot-detection signal. A simple threshold $ au = L_{ ext{train}}$ on the MAE loss distinguishes parroted from novel content, enabling scalable, real-time detection on CAD sketches from the SketchGraphs dataset. Experiments show that longer training improves detection for seen and modified samples but can increase false positives on novel content; the MAE-based detector also outperforms Weisfeiler-Lehman graph-hash benchmarks under various binning schemes. The approach offers a practical, scalable path to copyright-aware content generation and invites extension to other data modalities and adaptive thresholding aligned with evolving legal standards.

Abstract

The advent of generative AI models has revolutionized digital content creation, yet it introduces challenges in maintaining copyright integrity due to generative parroting, where models mimic their training data too closely. Our research presents a novel approach to tackle this issue by employing an overfitted Masked Autoencoder (MAE) to detect such parroted samples effectively. We establish a detection threshold based on the mean loss across the training dataset, allowing for the precise identification of parroted content in modified datasets. Preliminary evaluations demonstrate promising results, suggesting our method's potential to ensure ethical use and enhance the legal compliance of generative models.

Detecting Generative Parroting through Overfitting Masked Autoencoders

TL;DR

This work tackles the challenge of generative parroting by training an overfitted Masked Autoencoder (MAE) and using reconstruction loss as a per-sample parrot-detection signal. A simple threshold

on the MAE loss distinguishes parroted from novel content, enabling scalable, real-time detection on CAD sketches from the SketchGraphs dataset. Experiments show that longer training improves detection for seen and modified samples but can increase false positives on novel content; the MAE-based detector also outperforms Weisfeiler-Lehman graph-hash benchmarks under various binning schemes. The approach offers a practical, scalable path to copyright-aware content generation and invites extension to other data modalities and adaptive thresholding aligned with evolving legal standards.

Abstract

Paper Structure (10 sections, 2 equations, 1 figure, 1 table)

This paper contains 10 sections, 2 equations, 1 figure, 1 table.

Introduction
Related Work
Methodology
Dataset
Masked Autoencoder (MAE) Loss
Overfitting and Threshold Setting
Experiments
Benchmark
Discussion
Conclusion

Figures (1)

Figure 1: Representative samples from the datasets: $D_{\text{train}}$ (original training set), $D_{\text{var 1}}$ (first variation), and $D_{\text{var 2}}$ (second variation), shown from the first to third rows, respectively

Detecting Generative Parroting through Overfitting Masked Autoencoders

TL;DR

Abstract

Detecting Generative Parroting through Overfitting Masked Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (1)