Jet: A Modern Transformer-Based Normalizing Flow
Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
TL;DR
Jet rethinks normalizing flows by using Vision Transformer blocks inside affine coupling layers, yielding a simple, highly effective model without multiscale architecture or extra normalization components. By training via exact log-likelihood with a dequantized input, Jet achieves state-of-the-art performance among coupling-based flows on ImageNet variants and benefits prominently from ImageNet-21k pretraining with successful transfer to ImageNet-1k and CIFAR-10. The work demonstrates that ViT-based coupling blocks can surpass CNN-based variants, while keeping the architecture compact and amenable to integration as a building block for larger generative systems like JetFormer. Overall, Jet offers a high-signal, transferable component for modern flow-based generative modeling and highlights the continued relevance of normalizing flows in conjunction with transformer architectures.
Abstract
In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.
