Table of Contents
Fetching ...

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang

TL;DR

This paper proposes EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs and improves the performance of NATs notably with significantly reduced computational cost.

Abstract

Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at https://github.com/LeapLabTHU/ENAT.

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

TL;DR

This paper proposes EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs and improves the performance of NATs notably with significantly reduced computational cost.

Abstract

Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at https://github.com/LeapLabTHU/ENAT.

Paper Structure

This paper contains 34 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The generation process of NATs starts from a masked canvas, decode multiple tokens per step, and are then mapped to the pixel space using a pre-trained VQ-decoder esser2021taming.
  • Figure 2: An ablation study on four types of spatial interactions. The essential spatial interaction is the [M] to [V] attention. In contrast, the [V] to [M] attention only marginally affects the model.
  • Figure 3: (a) Existing works of NATs process visible and [MASK] tokens equivalently. (b) Our disentangled architecture independently encodes visible tokens and integrates their fully contextualized features into the [MASK] token decoding process. $\bm{M}$ is the indicator of [MASK] tokens while $\bm{\bar{M}}$ is the indicator of visible tokens. The SC-Attention concatenates the visible and mask token features to produce keys and values, providing a complete context for the mask token decoding.
  • Figure 4: Overview of ENAT. Based on the disentangled architecture in Fig. \ref{['fig:disentangle']}b, we further propose to only encode the critical (i.e., newly decoded) tokens and maximally reuse previously extracted features to supplement necessary information. $\bm{\Delta}$ is the indicator of newly decoded tokens. Only one transformer block is illustrated for simplicity.
  • Figure 5: Feature similarity analysis. (a) We randomly choose two samples and visualize the token-to-token feature similarity between adjacent steps (2 & 3 and 6 & 7), with the positions of newly decoded tokens visualized on the right. (b) The token feature similarity averaged over 50,000 generated samples in each pair of adjacent steps ($t=1\!\rightarrow\!2$, $t=2\!\rightarrow\!3$, $\ldots$, $t=7\!\rightarrow\!8$).
  • ...and 4 more figures