Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection
Haotian Qin, Dongliang Chang, Yueying Gao, Bingyao Yu, Lei Chen, Zhanyu Ma
TL;DR
This work tackles the generalization gap in CLIP-based AI-generated image detectors caused by feature redundancy. It introduces InfoFD, a multimodal conditional information bottleneck framework that conditions latent representations on text and class information, using Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO) to reduce redundancy and exploit text signals. The method employs a variational approximation with Gram-Schmidt orthogonalization and a training objective combining a maximum-mean-discrepancy term with a classification loss, alongside Composite Gaussian Perturbation to simulate multi-domain variations. Empirical results on GenImage and CO-SPY demonstrate strong cross-model generalization without CLIP fine-tuning, with ablations confirming the benefit of CGP and the text-conditioned approach. The work highlights a textual bias in CLIP features that can be harnessed for improved detection and provides a robust pipeline for generalized AI-generated image detection.
Abstract
Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at https://github.com/Ant0ny44/InfoFD.
