Table of Contents
Fetching ...

ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models

Qinyu Zhao, Stephen Gould, Liang Zheng

TL;DR

ARINAR introduces a bi-level autoregressive framework that generates each image token feature-by-feature in a latent space. The outer AR produces a condition vector $\boldsymbol{z}$ for an inner AR, which, conditioned on $\boldsymbol{z}$, autoregressively models each of the token's $16$ features using a Gaussian Mixture Model, enabling efficient sampling. On ImageNet $256\times256$, a $213$M-parameter ARINAR-B achieves $\text{FID}=2.75$ with CFG, rivaling state-of-the-art MAR-B while being roughly $5\times$ faster, and maintaining competitive quality without diffusion. This work demonstrates that latent-space, feature-level autoregression can match or exceed diffusion-based methods in both speed and accuracy, with CFG and temperature offering additional quality control. It points to the potential of latent bi-level autoregressive designs for scalable, fast image generation.

Abstract

Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.

ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models

TL;DR

ARINAR introduces a bi-level autoregressive framework that generates each image token feature-by-feature in a latent space. The outer AR produces a condition vector for an inner AR, which, conditioned on , autoregressively models each of the token's features using a Gaussian Mixture Model, enabling efficient sampling. On ImageNet , a M-parameter ARINAR-B achieves with CFG, rivaling state-of-the-art MAR-B while being roughly faster, and maintaining competitive quality without diffusion. This work demonstrates that latent-space, feature-level autoregression can match or exceed diffusion-based methods in both speed and accuracy, with CFG and temperature offering additional quality control. It points to the potential of latent bi-level autoregressive designs for scalable, fast image generation.

Abstract

Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.

Paper Structure

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Design of A RINAR. The model consists of two AR layers. The outer layer predicts a condition vector $\boldsymbol{z}$, while the inner layer autoregressively generates the features of the next token based on $\boldsymbol{z}$. The blue dashed arrows indicate the gradient flow. GMM means the Gaussian Mixture Model.
  • Figure 2: Qualitative Results. We show selected examples of class-conditional generation on ImageNet 256$\times$256 using A RINAR-B.
  • Figure 3: Generated samples from A RINAR.
  • Figure 4: Failure cases. Some cases are very challenging to A RINAR, such as human bodies, multiple objects, text, and complex structure. The generated images contain noticeable artifacts.