Table of Contents
Fetching ...

Normalizing Flows are Capable Generative Models

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind

TL;DR

TarFlow introduces a Transformer-based autoregressive Normalizing Flow that scales image-density modeling by stacking block autoregressive Transformer blocks on image patches with alternating directions. It combines Gaussian noise augmentation, a post-training score-based denoising step, and guidance for both conditional and unconditional sampling to achieve diffusion-like sample quality while retaining exact likelihoods. The approach sets new state-of-the-art likelihood on ImageNet 64×64 and delivers competitive sample quality across multiple resolutions, illustrating that normalizing flows can match modern generative models in both density estimation and generation. This work suggests a scalable, simple NF path to high-fidelity image generation with practical training and sampling strategies.

Abstract

Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/ml-tarflow.

Normalizing Flows are Capable Generative Models

TL;DR

TarFlow introduces a Transformer-based autoregressive Normalizing Flow that scales image-density modeling by stacking block autoregressive Transformer blocks on image patches with alternating directions. It combines Gaussian noise augmentation, a post-training score-based denoising step, and guidance for both conditional and unconditional sampling to achieve diffusion-like sample quality while retaining exact likelihoods. The approach sets new state-of-the-art likelihood on ImageNet 64×64 and delivers competitive sample quality across multiple resolutions, illustrating that normalizing flows can match modern generative models in both density estimation and generation. This work suggests a scalable, simple NF path to high-fidelity image generation with practical training and sampling strategies.

Abstract

Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/ml-tarflow.

Paper Structure

This paper contains 25 sections, 11 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: TarFlow demonstrates substantial progress in the domain of normalizing flow models, achieving state-of-the-art results in both density estimation and sample generation. Left: We show the historical progression of likelihood performance on ImageNet 64x64, measured in bits per dimension (BPD), where our model significantly outperforms previous methods (see Table \ref{['tab:likelihood']} for details). Right: Selected samples from our model trained on ImageNet 128x128 demonstrate unprecedented image quality and diversity for a normalizing flow model, establishing a new benchmark for this class of generative models.
  • Figure 2: Left, TarFlow consists of $T$ flow blocks trained end to end; Right, a zoom-in view of each flow bock, which contains a sequence permutation operation, a standard causal Transformer, and an affine transformation to the permuted inputs.
  • Figure 3: Images of various resolutions generated by TarFlow models. From left to right, top to bottom: 256x256 images on AFHQ, 128x128 and 64x64 images on ImageNet.
  • Figure 4: Top: The effect of input noise $\sigma$ and denoising, all samples are generated with guidance weight $w = 2$ on ImageNet 128x128 from the same initial noise, better viewed when zoomed in. Bottom: Sample FID vs input noise $\sigma$ on ImageNet 64x64, with and without denoising. Before denosing, it first appears that small $\sigma$ has the best FID, due to the smaller amount of noise present in the raw samples. However, after denoising with Equation \ref{['eq:denoise']}, slightly larger $\sigma$ favors better FID and demonstrates more consistent shapes. Note that the scale of the right y-axis differs from that of the left.
  • Figure 5: Guidance weight $w$ vs FID for both the conditional and unconditional models (with $\tau=1.5$) on ImageNet 64x64. Note the y axis's scale difference between the two settings.
  • ...and 7 more figures