Table of Contents
Fetching ...

SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

Qinyu Zhao, Guangting Zheng, Tao Yang, Rui Zhu, Xingjian Leng, Stephen Gould, Liang Zheng

TL;DR

The paper tackles bottlenecks in latent normalizing flows caused by data-augmentation noise and frozen VAE encoders by fixing the VAE encoder variance to a constant, enabling end-to-end joint training of VAEs and NFs.SimFlow demonstrates strong generation quality on ImageNet 256×256, achieving a gFID of 2.15 and further improvements to 1.91 when combined with REPA-E, indicating state-of-the-art performance among NF-based models.The approach yields a smoother, more generation-friendly latent space, supported by analyses of latent-space statistics, and is complemented by enhancements like REPA-E alignment and a revised classifier-free guidance strategy.Overall, SimFlow simplifies training pipelines, accelerates convergence, and offers a practical path toward high-fidelity, end-to-end trained latent NF-based generation on large-scale datasets.

Abstract

Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 \times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.

SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows

TL;DR

The paper tackles bottlenecks in latent normalizing flows caused by data-augmentation noise and frozen VAE encoders by fixing the VAE encoder variance to a constant, enabling end-to-end joint training of VAEs and NFs.SimFlow demonstrates strong generation quality on ImageNet 256×256, achieving a gFID of 2.15 and further improvements to 1.91 when combined with REPA-E, indicating state-of-the-art performance among NF-based models.The approach yields a smoother, more generation-friendly latent space, supported by analyses of latent-space statistics, and is complemented by enhancements like REPA-E alignment and a revised classifier-free guidance strategy.Overall, SimFlow simplifies training pipelines, accelerates convergence, and offers a practical path toward high-fidelity, end-to-end trained latent NF-based generation on large-scale datasets.

Abstract

Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.

Paper Structure

This paper contains 24 sections, 19 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Comparison of our framework with closely-related methods. (a) Standard practice trains a VAE first and then train a generative model with the VAE frozen. Note that, for each image, the VAE encoder outputs the mean $\boldsymbol{\mu}$ and the variance $\boldsymbol{\sigma}^2$ of a Gaussian, from which a set of tokens is sampled. The variance $\boldsymbol{\sigma}^2$ is usually very small in a standard pretrained VAE. (b) REPA-E repa_e jointly trains diffusion with VAE using the REPA loss repa, and the diffusion gradient is stopped before the VAE to avoid latent collapse (where the token variation decreases and the generation quality degrades). (c) STARFlow starflow trains NF and decoder on noisy latent with the encoder frozen. (d) We train an NF and a VAE in an end-to-end way from scratch. There is no stop-gradient operator, significantly simplifying prior frameworks. Solid arrows indicate the forward pass, while dashed arrows denote gradient flows. We label frozen modules in gray, generative models in green, and VAE modules involved in training in red.
  • Figure 2: Comparing SimFlow with prior works. On ImageNet $256\times256$, our end-to-end trained model SimFlow achieves significantly better generation quality than the state-of-the-art NF model STARFlow starflow with much fewer training epochs. Training SimFlow with REPA-E repa_e further improves gFID.
  • Figure 3: Robustness of VAEs with fixed variances. (a) A VAE with a large and fixed variance can maintain reconstruction quality under latent noise, while the performance of a VAE with learnable variance degrades significantly. (b) For VAEs with a large variance, the images reconstructed from linearly interpolated latents still clearly show the main subjects (the cat or the dog), rather than blending them. 'Learnable' indicates a standard VAE with learnable variance, while '$\bar{\sigma}^2=x^2$' denotes a VAE with a fixed variance of $x^2$.
  • Figure 4: End-to-end training makes latent space more suitable for developing generative models. (a) Spectral entropy measures the randomness of frequency components; lower values indicate simpler data distributions in the frequency domain. (b) Ratio of high-frequency components. (c) Total variation captures the overall local changes across tokens; lower values imply smoother latents. (d) Autocorrelation reflects how similar a token sequence is to a shifted version of itself; higher autocorrelation indicates stronger spatial consistency.
  • Figure 5: Variant studies. 'Frozen VAE' means both VAE encoder and decoder are frozen during training. 'Frozen enc' means the decoder is trained. 'End-to-end' means VAE encoder and decoder, and the NF are jointly trained from scratch. 'Learnable var' means the variance is predicted by the VAE, while 'Fixed var' is our method with $\bar{\sigma}^2=0.5^2$. 'LN' denotes applying a layer normalization on the VAE encoder. 'Noise augmented' indicates adding Gaussian noise to VAE latents as done by starflow.
  • ...and 8 more figures