Table of Contents
Fetching ...

Laplacian Multi-scale Flow Matching for Generative Modeling

Zelin Zhao, Petr Molodyk, Haotian Xue, Yongxin Chen

TL;DR

This paper presents Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling and achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines.

Abstract

In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024$\times$1024) while maintaining lower computational overhead.

Laplacian Multi-scale Flow Matching for Generative Modeling

TL;DR

This paper presents Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling and achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines.

Abstract

In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 10241024) while maintaining lower computational overhead.
Paper Structure (39 sections, 20 equations, 5 figures, 14 tables, 6 algorithms)

This paper contains 39 sections, 20 equations, 5 figures, 14 tables, 6 algorithms.

Figures (5)

  • Figure 1: Multi-scale generation process of our model. The proposed model follows a coarse-to-fine generation strategy across scales in a Laplacian pyramid. This figure demonstrates a three-level version of ours, where $T_2$, $T_1$ are two critical points defining three sampling segments for three scales. Starting from a random noise at $t=0$, our model first denoises the coarsest scale until $t=T_2$, then progressively conditions finer scales on completed coarser scales ($t=T_2$ to $T_1$ and $t=T_1$ to $1$). This causal structure ensures coherent image generation by maintaining hierarchical dependencies across scales, ultimately producing high-fidelity samples with both global consistency and fine details.
  • Figure 2: (Left:) Schematic of the LapFlow model $V_\theta$. The multi-scale transformer takes multi-scale noisy states as input, conditioned on time and label, and predicts velocities for each input scale. While the model can take an arbitrary number of scales as input, we show two here for simplicity. (Middle:) Details of one multi-scale MoT block. We use separate QKVs for different scales, while the attention is computed globally. Furthermore, we adopt a mask to enforce causal relationships across scales. (Right:) Details of scale-specific PreAttnMod and PostAttnMod modules DiT, where each PostAttnMod module includes a feedforward network (FFN).
  • Figure 3: Qualitative results on CelebA-HQ 1024 (left two), 512 (middle two), and 256 (right).
  • Figure 4: Qualitative results on ImageNet $256\times256$ using our trained B/2 model with CFG=$1.5$.
  • Figure 5: FID-50K on ImageNet (256$\times$256) across training iterations comparing LFM LFM with ours using two backbones (B/2 and XL/2).