Table of Contents
Fetching ...

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Changyao Tian, Chenxin Tao, Jifeng Dai, Hao Li, Ziheng Li, Lewei Lu, Xiaogang Wang, Hongsheng Li, Gao Huang, Xizhou Zhu

TL;DR

ADDP tackles the gap between image recognition and generation by learning general representations through an alternating denoising diffusion process that operates across raw pixels and VQ tokens. By decoding pixels from VQ tokens and then generating new VQ tokens from those pixels, ADDP jointly trains a pixel-to-token generator and a VQ-based decoder within an ELBO objective, enabling strong performance in both unconditional image generation and dense recognition tasks. Empirical results show competitive generation quality and transferability to ImageNet classification, COCO detection, and ADE20k segmentation, with ablations highlighting the importance of conditioning on reliable tokens and using the token-predictor target $q(ar{z}_{t-1}|z_t)$. Overall, ADDP demonstrates the viability of general representations that support both synthesis and dense perception, with code released for reproducibility and potential extension to higher resolutions and continuous diffusion.

Abstract

Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code is released at \url{https://github.com/ChangyaoTian/ADDP}.

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

TL;DR

ADDP tackles the gap between image recognition and generation by learning general representations through an alternating denoising diffusion process that operates across raw pixels and VQ tokens. By decoding pixels from VQ tokens and then generating new VQ tokens from those pixels, ADDP jointly trains a pixel-to-token generator and a VQ-based decoder within an ELBO objective, enabling strong performance in both unconditional image generation and dense recognition tasks. Empirical results show competitive generation quality and transferability to ImageNet classification, COCO detection, and ADE20k segmentation, with ablations highlighting the importance of conditioning on reliable tokens and using the token-predictor target . Overall, ADDP demonstrates the viability of general representations that support both synthesis and dense perception, with code released for reproducibility and potential extension to higher resolutions and continuous diffusion.

Abstract

Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code is released at \url{https://github.com/ChangyaoTian/ADDP}.
Paper Structure (27 sections, 17 equations, 16 figures, 19 tables, 2 algorithms)

This paper contains 27 sections, 17 equations, 16 figures, 19 tables, 2 algorithms.

Figures (16)

  • Figure 1: Inference pipelines of unified methods that learn general representations for both generation and recognition. Previous methods are modeled either entirely in raw-pixel space (iGPT chen2020generative) or entirely in VQ-token space (ViT-VQGAN yu2021vectorquantized and MAGE li2022mage). In contrast, ADDP exploits both spaces, yielding competitive performances on both recognition and generation tasks.
  • Figure 2: Alternating denoising process. Our method first predicts $p_{\theta}(z_T, \bar{z}_T|\varnothing)$ by directly feeding all mmask tokens into our decoder $D$ in Eq. (\ref{['equ:p2t']}). At each step $t$, the noisy image $x_t$ is decoded according to Eq. (\ref{['equ:t2p']}), then used to generate new reliable tokens $z_{t-1}$ and unreliable tokens $\bar{z}_{t-1}$ according to Eq. (\ref{['equ:p2t']}). $x_0$ is the final synthesized noisy-free image.
  • Figure 3: Diffusion process.
  • Figure 4: Training pipeline of ADDP. The original training image $x_0$ is first encoded into VQ token $z_0$, then a certain timestep $t$ is sampled. The reliable and unreliable tokens $z_t$ and $\bar{z}_t$ are generated according to the diffusion process in Sec. \ref{['subsec:diffusion_process']}. After that, $x_t$ is decoded by token-to-pixel decoding in Sec. \ref{['subsec:denoising_process']}. Our pixel-to-token generation network takes $x_t$ as input and generate the prediction of $\bar{z}_{t-1}$. $q(\bar{z}_{t-1}|z_{t})$ is used as the training target as mentioned in Sec. \ref{['subsec:learning_process']}. The lock symbol means that these networks are freezed during training.
  • Figure 5: Inference pipeline of ADDP for image generation and recognition.
  • ...and 11 more figures