CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection
Minsuk Jang, Hyeonseo Jeong, Minseok Son, Changick Kim
TL;DR
CINEMAE addresses the cross-generator generalization gap in AI-generated image detection by integrating a global semantic signal from a frozen Masked Autoencoder with a local context anomaly signal derived from context-aware reconstruction. The method formalizes a Local Contextual Anomaly Score that blends statistical deviation with reconstruction error, and fuses these signals via additive fusion after MAE freezing, achieving strong cross-generator performance. On GenImage, CINEMAE attains a mean accuracy of about $95.96\%$, with consistent performance across eight unseen generators, and shows robust zero-shot results on Chameleon with reduced bias. The work highlights the value of context-conditioned reconstruction uncertainty as a transferable authenticity cue and outlines avenues for combining this approach with frequency-based detectors for even greater robustness.
Abstract
While context-based detectors have achieved strong generalization for AI-generated text by measuring distributional inconsistencies, image-based detectors still struggle with overfitting to generator-specific artifacts. We introduce CINEMAE, a novel paradigm for AIGC image detection that adapts the core principles of text detection methods to the visual domain. Our key insight is that Masked AutoEncoder (MAE), trained to reconstruct masked patches conditioned on visible context, naturally encodes semantic consistency expectations. We formalize this reconstruction process probabilistically, computing conditional Negative Log-Likelihood (NLL, p(masked | visible)) to quantify local semantic anomalies. By aggregating these patch-level statistics with global MAE features through learned fusion, CINEMAE achieves strong cross-generator generalization. Trained exclusively on Stable Diffusion v1.4, our method achieves over 95% accuracy on all eight unseen generators in the GenImage benchmark, substantially outperforming state-of-the-art detectors. This demonstrates that context-conditional reconstruction uncertainty provides a robust, transferable signal for AIGC detection.
