Table of Contents
Fetching ...

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

Wenbin Wang, Yuge Huang, Jianqing Xu, Yue Yu, Jiangtao Yan, Shouhong Ding, Pan Zhou, Yong Luo

TL;DR

A lightweight fusion adapter, TranX-Adapter, is proposed, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features.

Abstract

Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

TL;DR

A lightweight fusion adapter, TranX-Adapter, is proposed, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features.

Abstract

Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).
Paper Structure (18 sections, 5 equations, 6 figures, 5 tables)

This paper contains 18 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between the previous fusion method and our TranX-Adapter. (a) Previous Method: Concatenates artifact (e.g., from NPR) and semantic features (e.g., from CLIP-ViT), resulting in uniform attention and weak interaction. (b) Our TranX-Adapter: Incorporates a lightweight bidirectional fusion mechanism that enhances feature interaction.
  • Figure 2: Comparison between the processed NPR input (a) and the CLIP-ViT input image (b), where NPR highlights local pixel interdependencies for synthetic image detection.
  • Figure 3: Distributional comparison of representational variances between CLIP-ViT and NPR: (a) variance of L2 norms and (b) variance of cosine similarities across image patches.
  • Figure 4: Comparison of cross-encoder interactions between CLIP-ViT and NPR. (a) Relative entropy of attention maps. Higher relative entropy values, approaching $1$, indicate that the distribution is closer to uniform. (b) Attention map for the direction $CLIP\rightarrow NPR$, where the query originates from NPR features and the key and value are derived from CLIP-ViT features. (c) Attention map for the direction $NPR\rightarrow CLIP$, where the query corresponds to CLIP-ViT features and the key and value are obtained from NPR features. (d) Relationship between the training loss and the information flow metrics $S$.
  • Figure 5: Overview of the proposed TranX-Adapter. Our TranX-Adapter consists of two complementary fusion modules. The Task-Aware Optimal-Transport Fusion (TOP-Fusion) aligns artifact and semantic feature predictions by computing JS divergence between their logits and transferring artifact features into the semantic features through optimal-transport, yielding an enhanced semantic feature $\hat{F}_{sem}$. The X-Fusion module transfers semantic features into artifact features via multi-layer cross-attention, producing $\hat{F}_{art}$. The fused representations are finally fed into the Large Language Model (LLM) for detection.
  • ...and 1 more figures