Table of Contents
Fetching ...

How to build a consistency model: Learning flow maps via self-distillation

Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden

TL;DR

This work tackles the computational bottleneck of sampling with flow-based models by proposing direct training of flow maps through self-distillation. It presents three algorithmic families—Lagrangian LSD, Eulerian ESD, and Progressive PSD—unifying existing distillation and direct-training schemes under a single framework and establishing theoretical guarantees for LSD and ESD. Empirically, LSD consistently delivers superior stability and sample quality across synthetic and real-world datasets, while ESD tends to be unstable and PSD is more sensitive to design choices. The approach provides a principled, practical pathway to faster, few-step generative modeling with a versatile design space and accessible code.

Abstract

Flow-based generative models achieve state-of-the-art sample quality, but require the expensive solution of a differential equation at inference time. Flow map models, commonly known as consistency models, encompass many recent efforts to improve inference-time efficiency by learning the solution operator of this differential equation. Yet despite their promise, these models lack a unified description that clearly explains how to learn them efficiently in practice. Here, building on the methodology proposed in Boffi et. al. (2024), we present a systematic algorithmic framework for directly learning the flow map associated with a flow or diffusion model. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert any distillation scheme into a direct training algorithm via self-distillation, eliminating the need for pre-trained teachers. We introduce three algorithmic families based on different mathematical characterizations of the flow map: Eulerian, Lagrangian, and Progressive methods, which we show encompass and extend all known distillation and direct training schemes for consistency models. We find that the novel class of Lagrangian methods, which avoid both spatial derivatives and bootstrapping from small steps by design, achieve significantly more stable training and higher performance than more standard Eulerian and Progressive schemes. Our methodology unifies existing training schemes under a single common framework and reveals new design principles for accelerated generative modeling. Associated code is available at https://github.com/nmboffi/flow-maps.

How to build a consistency model: Learning flow maps via self-distillation

TL;DR

This work tackles the computational bottleneck of sampling with flow-based models by proposing direct training of flow maps through self-distillation. It presents three algorithmic families—Lagrangian LSD, Eulerian ESD, and Progressive PSD—unifying existing distillation and direct-training schemes under a single framework and establishing theoretical guarantees for LSD and ESD. Empirically, LSD consistently delivers superior stability and sample quality across synthetic and real-world datasets, while ESD tends to be unstable and PSD is more sensitive to design choices. The approach provides a principled, practical pathway to faster, few-step generative modeling with a versatile design space and accessible code.

Abstract

Flow-based generative models achieve state-of-the-art sample quality, but require the expensive solution of a differential equation at inference time. Flow map models, commonly known as consistency models, encompass many recent efforts to improve inference-time efficiency by learning the solution operator of this differential equation. Yet despite their promise, these models lack a unified description that clearly explains how to learn them efficiently in practice. Here, building on the methodology proposed in Boffi et. al. (2024), we present a systematic algorithmic framework for directly learning the flow map associated with a flow or diffusion model. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert any distillation scheme into a direct training algorithm via self-distillation, eliminating the need for pre-trained teachers. We introduce three algorithmic families based on different mathematical characterizations of the flow map: Eulerian, Lagrangian, and Progressive methods, which we show encompass and extend all known distillation and direct training schemes for consistency models. We find that the novel class of Lagrangian methods, which avoid both spatial derivatives and bootstrapping from small steps by design, achieve significantly more stable training and higher performance than more standard Eulerian and Progressive schemes. Our methodology unifies existing training schemes under a single common framework and reveals new design principles for accelerated generative modeling. Associated code is available at https://github.com/nmboffi/flow-maps.

Paper Structure

This paper contains 58 sections, 12 theorems, 103 equations, 6 figures, 2 tables, 4 algorithms.

Key Result

Lemma 2.0

Let $X_{s, t}$ denote the flow map. Then, i.e. the tangent vectors to the curve $(X_{s, t}(x))_{t \in [s, 1]}$ give the velocity field $b_t(x)$ for every $x$.

Figures (6)

  • Figure 1: Overview. (A) Schematic of the two-time flow map $X_{s,t}$ and the tangent condition (\ref{['lemma:flow_map_b']}), which provides a relation between the map and the drift of the probability flow. The flow map is composable, invertible, and has the property that as $t\rightarrow s$, its time derivative recovers the drift $b_s$ from \ref{['eqn:ode']}. (B) Illustration of our proposed parameterization. The function $v_{s, t}$ estimates the slope of the line drawn between two points on a trajectory of the probability flow, and can be directly trained efficiently via the tangent condition.
  • Figure 2: Self-distillation. Our plug-and-play approach pairs any distillation objective $\mathcal{L}_{\mathsf{D}}$ on the off-diagonal $s \neq t$ of the square $[0, 1]^2$ with a flow matching objective $\mathcal{L}_b$ on the diagonal $s=t$ to obtain a direct training algorithm for the flow map.
  • Figure 3: Checker dataset. Qualitative results for the two-dimensional checker dataset. LSD performs the best across all step counts except $N=16$ (\ref{['tab:benchmarks']}). All methods improve as the number of steps increase. ESD and both PSD variants fail to capture the sharp boundaries at small $N$, introducing artifacts and driving $\mathsf{KL}$ higher.
  • Figure 4: CIFAR-10: Parameter gradient norms. Spatial and temporal representations in the flow map impact parameter gradient norms of self-distillation methods that require network time and space derivatives.
  • Figure 5: Progressive refinement. Sample quality as a function of sampling steps using the same eight fixed noise samples across all methods for fair comparison. (Top) CIFAR-10, (Middle) CelebA-64, (Bottom) AFHQ-64. LSD consistently produces coherent samples across all datasets and step counts.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Lemma 2.0: Tangent condition
  • Proposition 2.0: Flow map
  • Proposition 2.0: Self-distillation
  • Proposition 2.0: Wasserstein bounds
  • Lemma A.0: Transport equation
  • proof
  • Proposition E.2
  • proof
  • Lemma E.2: Tangent condition
  • proof
  • ...and 9 more