Table of Contents
Fetching ...

Jet: A Modern Transformer-Based Normalizing Flow

Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

TL;DR

Jet rethinks normalizing flows by using Vision Transformer blocks inside affine coupling layers, yielding a simple, highly effective model without multiscale architecture or extra normalization components. By training via exact log-likelihood with a dequantized input, Jet achieves state-of-the-art performance among coupling-based flows on ImageNet variants and benefits prominently from ImageNet-21k pretraining with successful transfer to ImageNet-1k and CIFAR-10. The work demonstrates that ViT-based coupling blocks can surpass CNN-based variants, while keeping the architecture compact and amenable to integration as a building block for larger generative systems like JetFormer. Overall, Jet offers a high-signal, transferable component for modern flow-based generative modeling and highlights the continued relevance of normalizing flows in conjunction with transformer architectures.

Abstract

In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.

Jet: A Modern Transformer-Based Normalizing Flow

TL;DR

Jet rethinks normalizing flows by using Vision Transformer blocks inside affine coupling layers, yielding a simple, highly effective model without multiscale architecture or extra normalization components. By training via exact log-likelihood with a dequantized input, Jet achieves state-of-the-art performance among coupling-based flows on ImageNet variants and benefits prominently from ImageNet-21k pretraining with successful transfer to ImageNet-1k and CIFAR-10. The work demonstrates that ViT-based coupling blocks can surpass CNN-based variants, while keeping the architecture compact and amenable to integration as a building block for larger generative systems like JetFormer. Overall, Jet offers a high-signal, transferable component for modern flow-based generative modeling and highlights the continued relevance of normalizing flows in conjunction with transformer architectures.

Abstract

In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.

Paper Structure

This paper contains 22 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of the Jet model. The dashed box contains a coupling layer computing an affine transform from one half of the input dimensions (patches or features) and then applying it to the other half of the input dinensions. The full model is obtained by stacking $N$ such invertible coupling layers.
  • Figure 2: Effect of different architecture design choices on the validation NLL (in bits per dimension), as a function of training compute. Figure \ref{['fig:ablate-vit-cnn']}: Results on ImageNet-1k $64\times64$ for CNN vs ViT blocks (the marker size is proportional to the model parameter count). ViT blocks clearly outperform CNN blocks for a given training compute budget. Figure \ref{['fig:block-depth']}: Results on ImageNet-21k $32\times32$ for different ViT depths. Increasing the block depth leads to improved results up to depth 5.
  • Figure 3: NLL as a function of training compute obtained when training Jet architectures with a range of architecture hyper-paramaters, for 4 different data sets. The size of each marker is proportional to the number of parameters in the model configuration. Overall we observe normalizing flow models benefit from scale, yet ImageNet-1k models start to overfit. When increasing the amount of data to ImageNet-21k size, we observe little overfitting and strong scaling trends.
  • Figure 4: Ablation of coupling types. Negative log-likelihood on ImageNet-1k $64$x$64$ when varying the ratio of channel-wise to spatial-wise couplings and when using different types of spatial-wise couplings. Results in table format in Appendix Table \ref{['tab:coupling-kinds']}.
  • Figure 5: Random samples for ImageNet-1k at both $32\times32$ and $64\times64$ resolution. We show samples from Jet when trained from scratch and when finetuning a model pretrained on ImageNet-21k. For comparison we also show samples from Flow++ ho2019flow++.