Table of Contents
Fetching ...

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang

TL;DR

The paper tackles the gap between semantic representation and generative quality in normalizing flows by introducing reverse representation alignment (R-REPA), which exploits NF invertibility to align intermediate features along the generative path with a pretrained vision encoder. It also proposes a training-free, test-time classification method to probe the NF’s semantic knowledge, and extends the approach to latent-space generation via a VAE backbone for high-resolution synthesis. Through extensive ablations and experiments on ImageNet at 64×64 and 256×256, R-REPA yields state-of-the-art NF performance, accelerates training by over 3×, and achieves superior FID, sFID, and classification accuracy compared to strong baselines. The method demonstrates robustness across encoders and scales to high resolutions with efficient two-step sampling, establishing a principled, invertibility-aware route to higher-fidelity flow-based generation. Code is released for reproducibility and further exploration.

Abstract

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

TL;DR

The paper tackles the gap between semantic representation and generative quality in normalizing flows by introducing reverse representation alignment (R-REPA), which exploits NF invertibility to align intermediate features along the generative path with a pretrained vision encoder. It also proposes a training-free, test-time classification method to probe the NF’s semantic knowledge, and extends the approach to latent-space generation via a VAE backbone for high-resolution synthesis. Through extensive ablations and experiments on ImageNet at 64×64 and 256×256, R-REPA yields state-of-the-art NF performance, accelerates training by over 3×, and achieves superior FID, sFID, and classification accuracy compared to strong baselines. The method demonstrates robustness across encoders and scales to high resolutions with efficient two-step sampling, establishing a principled, invertibility-aware route to higher-fidelity flow-based generation. Code is released for reproducibility and further exploration.

Abstract

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 6464 and 256256. Our code is available at https://github.com/MCG-NJU/FlowBack.

Paper Structure

This paper contains 51 sections, 9 equations, 7 figures, 9 tables, 2 algorithms.

Figures (7)

  • Figure 1: Take TARFlow as a representative NF. (a) Training process maps images to a noise distribution. (b) The reverse pass generates images. (c) Optimizing a label token by NF loss to classify. (d) The FID-Accuracy plot demonstrates that our representation alignment improves both generation quality and classification performance.
  • Figure 2: Selected Samples on ImageNet 256 $\times$ 256 from L-TARFlow + R-REPA. We use classifier-free guidance equal to 2.0.
  • Figure 3: An overview of our Representation Alignment (REPA) mechanism. Left: Intermediate features from a TARFlow block are projected by an MLP and aligned with features from a pre-trained visual encoder. Right: The three gradient backpropagation strategies explored: (a) Forward REPA (F-REPA), updating all preceding blocks; (b) Detach REPA (D-REPA), updating only the current block; and (c) Reverse REPA (R-REPA), which leverages the inverse (generative) computational graph to update all subsequent blocks. While we depict alignment at a single location for clarity, this mechanism can be applied concurrently across multiple layers.
  • Figure 4: Validation of our proposed classification metric against the standard linear probing protocol. The plot shows classification accuracy of our metric and standard linear probing.
  • Figure 5: Hyperparameters Ablations and Training Convergence. Left: CFG search results on ImageNet 64$\times$64. The Reverse REPA strategy applied to later model blocks yields the best performance across various CFG scales. Center: Ablation of noise standard deviation on latent space. We identify an optimal noise standard deviation of 0.20. Right: Reverse REPA improves sample fidelity and accelerates training convergence by 3.3$\times$ on ImageNet 64$\times$64.
  • ...and 2 more figures