CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Zheyuan Hu; Chieh-Hsin Lai; Yuki Mitsufuji; Stefano Ermon

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

TL;DR

Flow-map models enable fast generation by learning the long jumps of the PF-ODE, but training stability and cost remain barriers. Consistency Mid-Training (CMT) introduces a compact intermediate stage that derives a trajectory-aware initializer from a pre-trained teacher, improving stability and data efficiency for both Consistency Models and Mean Flow. Theoretical results show reduced gradient bias and favorable convergence, while extensive experiments report state-of-the-art two-step FIDs across CIFAR-10 and ImageNet scales with up to ~98% savings in data and GPU time. Overall, CMT is a principled, architecture-agnostic method that makes flow-map learning more practical and scalable for vision generation.

Abstract

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

TL;DR

Abstract

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (20)