Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks
Run Hao, Peng Ying
TL;DR
This work addresses the vulnerability of AIGC detectors to semantic prompt attacks in text-to-image portrait generation. It introduces a grammar-tree based prompt generator combined with a variant of Monte Carlo Tree Search called UCT-RAND to automatically and efficiently explore semantically rich prompts that evade detectors across multiple generation models. The approach demonstrates strong evasion of both open-source and commercial detectors and ranks first in a real-world adversarial AIGC detection competition, while also enabling the creation of diverse adversarial datasets for robustness training. The study highlights detector fragility to semantic manipulation and offers a practical framework for producing challenging evaluation data and guiding more robust defense strategies against AIGC detectors.
Abstract
The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.
