Table of Contents
Fetching ...

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu

TL;DR

This work addresses the generalization gap in AI-generated image detection by revealing that CLIP's semantic information embedded in visual features can hinder discrimination. It introduces NS-Net, which decouples semantic content through NULL-Space projection using text-derived semantics and enhances artifact-focused detection with a Patch Selection strategy and contrastive learning. The approach yields strong cross-domain performance across 40 generative models, outperforming existing methods on GenImage, UniversalFakeDetect, and AIGIBench, and demonstrates plug-and-play applicability to other detectors. The results highlight the value of semantic disentanglement and localized artifact preservation for robust AI-generated image detection in open-world settings.

Abstract

The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

TL;DR

This work addresses the generalization gap in AI-generated image detection by revealing that CLIP's semantic information embedded in visual features can hinder discrimination. It introduces NS-Net, which decouples semantic content through NULL-Space projection using text-derived semantics and enhances artifact-focused detection with a Patch Selection strategy and contrastive learning. The approach yields strong cross-domain performance across 40 generative models, outperforming existing methods on GenImage, UniversalFakeDetect, and AIGIBench, and demonstrates plug-and-play applicability to other detectors. The results highlight the value of semantic disentanglement and localized artifact preservation for robust AI-generated image detection in open-world settings.

Abstract

The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

Paper Structure

This paper contains 18 sections, 12 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: T-SNE Visualization of Features Extracted from the Matched Dataset and the Mismatched Dataset.
  • Figure 2: Architecture of NS-Net for Generalizable AI-Generated Image Detection. Specifically, we first employ the Patch Selection strategy adjusted for CLIP’s input size to preserve potential forgery-related artifacts. Subsequently, the visual features extracted by the CLIP's image encoder are projected onto the NULL-Space of the semantic information, effectively removing task-irrelevant semantic components. The resulting features, tailored for the detection task, are then utilized in a contrastive learning framework, which can not only guide the linear classification layer but also capture the intrinsic distributional differences between real and AI-generated images, enhancing the model’s ability to generalize beyond simple classification.
  • Figure 3: T-SNE Visualization of Features Extracted before Classifier. We compare the VIB-Net and our NS-Net. A total of four testing GANs and diffusion models are considered, including SDXL, FLUX, R3GAN, and Guided.