Table of Contents
Fetching ...

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

TL;DR

Flow-map models enable fast generation by learning the long jumps of the PF-ODE, but training stability and cost remain barriers. Consistency Mid-Training (CMT) introduces a compact intermediate stage that derives a trajectory-aware initializer from a pre-trained teacher, improving stability and data efficiency for both Consistency Models and Mean Flow. Theoretical results show reduced gradient bias and favorable convergence, while extensive experiments report state-of-the-art two-step FIDs across CIFAR-10 and ImageNet scales with up to ~98% savings in data and GPU time. Overall, CMT is a principled, architecture-agnostic method that makes flow-map learning more practical and scalable for vision generation.

Abstract

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

TL;DR

Flow-map models enable fast generation by learning the long jumps of the PF-ODE, but training stability and cost remain barriers. Consistency Mid-Training (CMT) introduces a compact intermediate stage that derives a trajectory-aware initializer from a pre-trained teacher, improving stability and data efficiency for both Consistency Models and Mean Flow. Theoretical results show reduced gradient bias and favorable convergence, while extensive experiments report state-of-the-art two-step FIDs across CIFAR-10 and ImageNet scales with up to ~98% savings in data and GPU time. Overall, CMT is a principled, architecture-agnostic method that makes flow-map learning more practical and scalable for vision generation.

Abstract

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

Paper Structure

This paper contains 53 sections, 11 theorems, 165 equations, 3 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.1

If $p_{\mathrm{prior}}$ matches the diffused marginal $p_T$This holds for sufficiently large $T$ as in EDM, or with appropriate noise schedules as in flow matching. Empirically, $p_T$ (data-dependent) and $p_{\mathrm{prior}}$ (data-free) perform identically, so we adopt $p_{\mathrm{prior}}$ in all e

Figures (3)

  • Figure 1: FID vs. training time for vanilla ECD geng2024ect and CMT (ours) on ImageNet $512{\times}512$. With the proposed mid-training, our CMT w/ ECD (as post-trained flow map) achieves SOTA two step FID of 1.84 using only 400 H100 GPU hours (mid- and post-training combined). Under the same budget, vanilla ECD still produces unrecognizable images, and even to reach a reasonable two step FID of 3.38 it requires 4643.99 hours. Overall, CMT reduces the total training cost of flow map models by 91.4% while achieving SOTA performance.
  • Figure 2: FID vs. training time for vanilla MF and CMT (ours) on ImageNet $256{\times}256$. We perform mid-training starting from a randomly-initialized XL/2 model, where CMT of XL/2 size learns to match the deterministic sampler of a weaker, smaller teacher MF-B/4. The resulting mid-trained weights of CMT-XL/2 are then used to initialize MF-XL/2 post-training. This initialization produces semantically meaningful samples early and drives significantly faster convergence. With CMT's pipeline, training reaches lower FID in only half the GPU hours compared to MF trained from scratch. MF initialized from SiT also converges fast, but requires more than $1520$ hours of pre-training, which exceeds the cost of training MF itself.
  • Figure 3: Two-Step Generated Images by CMT. Using the trained CMT (w/ ECD) on 512$\times$512, we achieve the best two-step FID of 1.84, at 93% lower cost than previous sCD.

Theorems & Definitions (20)

  • Theorem 3.1
  • Theorem 5.1: Informal Bias Comparison
  • Proposition F.1: Oracle CM minimizer
  • proof
  • proof
  • Lemma F.1
  • proof
  • Lemma F.2
  • proof
  • Lemma F.3
  • ...and 10 more