Table of Contents
Fetching ...

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

Zanlin Ni, Yulin Wang, Renping Zhou, Rui Lu, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Yuan Yao, Gao Huang

TL;DR

AdaNAT is proposed, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated and demonstrates that simple reward designs such as FID or pre-trained reward models may not reliably guarantee the desired quality or diversity of generated samples.

Abstract

Recent studies have demonstrated the effectiveness of token-based methods for visual content generation. As a representative work, non-autoregressive Transformers (NATs) are able to synthesize images with decent quality in a small number of steps. However, NATs usually necessitate configuring a complicated generation policy comprising multiple manually-designed scheduling rules. These heuristic-driven rules are prone to sub-optimality and come with the requirements of expert knowledge and labor-intensive efforts. Moreover, their one-size-fits-all nature cannot flexibly adapt to the diverse characteristics of each individual sample. To address these issues, we propose AdaNAT, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. In specific, we formulate the determination of generation policies as a Markov decision process. Under this framework, a lightweight policy network for generation can be learned via reinforcement learning. Importantly, we demonstrate that simple reward designs such as FID or pre-trained reward models, may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of policy networks effectively. Comprehensive experiments on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M, validate the effectiveness of AdaNAT. Code and pre-trained models will be released at https://github.com/LeapLabTHU/AdaNAT.

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

TL;DR

AdaNAT is proposed, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated and demonstrates that simple reward designs such as FID or pre-trained reward models may not reliably guarantee the desired quality or diversity of generated samples.

Abstract

Recent studies have demonstrated the effectiveness of token-based methods for visual content generation. As a representative work, non-autoregressive Transformers (NATs) are able to synthesize images with decent quality in a small number of steps. However, NATs usually necessitate configuring a complicated generation policy comprising multiple manually-designed scheduling rules. These heuristic-driven rules are prone to sub-optimality and come with the requirements of expert knowledge and labor-intensive efforts. Moreover, their one-size-fits-all nature cannot flexibly adapt to the diverse characteristics of each individual sample. To address these issues, we propose AdaNAT, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. In specific, we formulate the determination of generation policies as a Markov decision process. Under this framework, a lightweight policy network for generation can be learned via reinforcement learning. Importantly, we demonstrate that simple reward designs such as FID or pre-trained reward models, may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of policy networks effectively. Comprehensive experiments on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M, validate the effectiveness of AdaNAT. Code and pre-trained models will be released at https://github.com/LeapLabTHU/AdaNAT.
Paper Structure (40 sections, 15 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 15 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 2: Optimization process of AdaNAT. We plot the training curve of AdaNAT-L ($T=4$) on ImageNet 256$\times$256 and visualize samples from different stages. We train the policy network to output suitable configuration, while keeping the pre-trained NAT model fixed and only use it for inference. FID-5K is used for efficient evaluation.
  • Figure 3: Visualizing the adaptive policy. The re-masking ratio $m^{(t)}$ (refine level), which controls the proportion of least-confident tokens to be refined at each step, is visualized as an example (see Section \ref{['sec:prelim_parallel_decoding']} for $m^{(t)}$'s definition). The policy network adaptively reduces $m^{(t)}$ for only minor refinements when the sample already reaches a decent quality; otherwise, it keeps adopting relatively higher $m^{(t)}$ for more adjustments.
  • Figure 4: Ablation on different reward designs in AdaNAT. (a) AdaNAT-FID: directly using FID salimans2016improved as the reward. (b) AdaNAT-PRM: using a pre-trained reward model xu2024imagereward. (c) AdaNAT: our main approach with adversarial reward model modeling.
  • Figure 5: Practical latency of AdaNAT on ImageNet 256$\times$256. GPU time is measured on an A100 GPU with batch size 50. CPU time is measured on Xeon 8358 CPU with batch size 1. $\dagger$ : DPM-Solver lu2022dpm augmented diffusion models.
  • Figure 6: Qualitative comparisons between AdaNAT and AutoNAT ni2024revisiting on ImageNet 256$\times$256. AdaNAT generates images with superior visual quality.
  • ...and 4 more figures