Table of Contents
Fetching ...

FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

TL;DR

FlowAR addresses the rigidity of prior scale-wise image generation by adopting a simple doubling-scale design and a flexible VAE tokenizer, eliminating the need for a specialized multi-scale discrete tokenizer. It pairs a scale-wise Transformer with a per-scale flow matching model, using Spatial-adaLN to condition generation on per-scale semantics. On ImageNet-256, FlowAR achieves state-of-the-art results, notably FID 1.65 for the largest model, outperforming VAR and diffusion-based baselines at comparable scales. The method's tokenizer- and scale-agnostic design enables easy integration with various VAEs and supports scalable, high-fidelity image synthesis in practical settings.

Abstract

Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator's dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR's intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \url{https://github.com/OliverRensu/FlowAR}.

FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

TL;DR

FlowAR addresses the rigidity of prior scale-wise image generation by adopting a simple doubling-scale design and a flexible VAE tokenizer, eliminating the need for a specialized multi-scale discrete tokenizer. It pairs a scale-wise Transformer with a per-scale flow matching model, using Spatial-adaLN to condition generation on per-scale semantics. On ImageNet-256, FlowAR achieves state-of-the-art results, notably FID 1.65 for the largest model, outperforming VAR and diffusion-based baselines at comparable scales. The method's tokenizer- and scale-agnostic design enables easy integration with various VAEs and supports scalable, high-fidelity image synthesis in practical settings.

Abstract

Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator's dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR's intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \url{https://github.com/OliverRensu/FlowAR}.

Paper Structure

This paper contains 12 sections, 10 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Performance Comparison. The proposed FlowAR, a general next-scale prediction model enhanced with flow matching, consistently outperforms state-of-the-art VAR var variants across different model sizes.
  • Figure 2: Comparison Between VAR and Our FlowAR in (a) Tokenizer and (b) Generator design. (a) VAR var utilizes a complex multi-scale residual VQGAN discrete tokenizer, whereas FlowAR can leverage any off-the-shelf VAE continuous tokenizer, constructing coarse scale token maps by directly downsampling the finest scale token map. (b) VAR’s generator is constrained by the same complex and rigid scale design as its tokenizer, while FlowAR benefits from a simple and flexible scale design, enhanced by the flow matching model.
  • Figure 3: Overview of The Proposed FlowAR. FlowAR consists of three main components: (1) an off-the-shelf VAE that extracts a continuous latent representation of the image. We then create a set of coarse-to-fine scales by downsampling this latent, forming a sequence of token maps ${s^1, s^2, \cdots, s^n}$, where each subsequent scale doubles in size from the previous one. (2) A scale-wise autoregressive Transformer that takes as input the sequence $\{[C], \text{Up}(s^1, 2), \ldots, \text{Up}(s^{n-1}, 2)\}$, where $[C]$ is a condition token and $\text{Up}(\cdot, 2)$ denotes upsampling by a factor of 2. This Transformer generates semantic representations for different scales, ${\hat{s}^1, \ldots, \hat{s}^{n}}$. (3) A scale-wise flow matching model, conditioned on the semantics $\hat{s}^i$ at each scale $i$ (time step conditions are not shown for simplicity), predicts the velocity given a random time step $t$ that moves the noises to the target data distribution.
  • Figure 4: Visualization of Samples Generated by FlowAR Using Different Tokenizers. FlowAR consistently produces high-quality visual samples across various tokenizer configurations including VAE from MAR mar and SD ldm.
  • Figure 5: Generated samples from FlowAR. FlowAR generate high-fidelity great grey owl (24) images.
  • ...and 7 more figures