Table of Contents
Fetching ...

$\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, Jiwen Lu

TL;DR

The paper tackles the rising challenge of detecting autoregressive-generated images by leveraging the distinctive discrete latent-space patterns of AR models. It introduces D$^3$QE, a pipeline that combines quantization-error features with a Discrete Distribution Discrepancy-Aware Transformer (D$^3$AT) and CLIP-based semantic embeddings to discriminate real from AR-generated images. A new ARForensics dataset with 7 AR models and balanced real/generated samples enables robust evaluation, where D$^3$QE outperforms state-of-the-art baselines and shows strong cross-paradigm generalization as well as resilience to perturbations. The approach offers a principled way to exploit codebook frequency statistics and quantization residuals for forensic detection, with practical impact for safeguarding authenticity in digital media.

Abstract

The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D$^3$QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D$^3$QE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.

$\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

TL;DR

The paper tackles the rising challenge of detecting autoregressive-generated images by leveraging the distinctive discrete latent-space patterns of AR models. It introduces DQE, a pipeline that combines quantization-error features with a Discrete Distribution Discrepancy-Aware Transformer (DAT) and CLIP-based semantic embeddings to discriminate real from AR-generated images. A new ARForensics dataset with 7 AR models and balanced real/generated samples enables robust evaluation, where DQE outperforms state-of-the-art baselines and shows strong cross-paradigm generalization as well as resilience to perturbations. The approach offers a principled way to exploit codebook frequency statistics and quantization residuals for forensic detection, with practical impact for safeguarding authenticity in digital media.

Abstract

The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (DQE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of DQE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.

Paper Structure

This paper contains 16 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of Discrete Distribution Discrepancy. To elucidate the mechanism of D$^3$QE, we analyze token probability distributions from LlamaGen's tokenizer using autoregressive sampling. (a) shows the full codebook vector probability distribution, while (b) displays the top-500 activation probabilities. The real data exhibits pronounced long-tail characteristics, whereas generated samples demonstrate concentrated probability mass in the peak regions, which D$^3$QE leverages for detection.
  • Figure 2: D$^3$QE pipeline. Our approach first extracts quantized representations through a VQVAE encoder, computes the discrete distribution discrepancy between pre- and post-quantization features, and obtains discrete features via the D$^3$AT module. Semantic features are extracted using CLIP in parallel. The feature alignment module processes global semantic features, which then fuse with local discrete features for binary classification between generated and real samples. Blue snowflake symbols indicate frozen parameters, while red flame symbols denote trainable modules.
  • Figure 3: Illustration of D$^3$ASA Module in Equation \ref{['eq:D3ASA']}, which incorporates distribution discrepancy information into the attention mechanism.
  • Figure 4: Robustness Analysis. Performance comparison under image cropping and JPEG compression. Our method maintains superior accuracy across different perturbation levels, demonstrating strong robustness against common image transformations.
  • Figure 5: Visualization of codebook activation patterns. Heatmaps show normalized logarithmic activation frequencies of VQVAE codebook vectors for (a) real samples and (b) generated samples, with (c) their log-ratio difference. Real samples exhibit uniform activation patterns, while generated samples show significant polarization in high-frequency regions.