Table of Contents
Fetching ...

Contrastive Flow Matching

George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman

TL;DR

This work introduces Contrastive Flow Matching (DeltaFM), a plug-and-play augmentation to conditional diffusion via a contrastive loss that enforces cross-class flow uniqueness, addressing the tendency of conditional flow matching to produce overlapping trajectories. DeltaFM increases discriminability across conditions without extra data or forward passes, achieving up to 9x faster training and 5x fewer denoising steps, while boosting image quality (FID improvements up to ~8.9) on ImageNet-1k and CC3M-based text-to-image tasks. The approach is compatible with Representation Alignment (REPA) and can be combined with classifier-free guidance (CFG) to further enhance performance, with analytical insights linking DeltaFM to CFG and ablations illustrating robust gains across model scales and datasets. The results demonstrate that enforcing conditional flow distinctiveness can markedly improve generation fidelity and efficiency in diffusion-based models, suggesting broader applicability to other conditional generative tasks.

Abstract

Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.

Contrastive Flow Matching

TL;DR

This work introduces Contrastive Flow Matching (DeltaFM), a plug-and-play augmentation to conditional diffusion via a contrastive loss that enforces cross-class flow uniqueness, addressing the tendency of conditional flow matching to produce overlapping trajectories. DeltaFM increases discriminability across conditions without extra data or forward passes, achieving up to 9x faster training and 5x fewer denoising steps, while boosting image quality (FID improvements up to ~8.9) on ImageNet-1k and CC3M-based text-to-image tasks. The approach is compatible with Representation Alignment (REPA) and can be combined with classifier-free guidance (CFG) to further enhance performance, with analytical insights linking DeltaFM to CFG and ablations illustrating robust gains across model scales and datasets. The results demonstrate that enforcing conditional flow distinctiveness can markedly improve generation fidelity and efficiency in diffusion-based models, suggesting broader applicability to other conditional generative tasks.

Abstract

Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: https://github.com/gstoica27/DeltaFM.git.

Paper Structure

This paper contains 35 sections, 11 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: $\Delta$FM yields more discriminative and higher quality trajectories.(left) shows the result of standard flow-matching, where flows are straight but end up overlapping for similar class distributions. (right) shows how the addition of the $\Delta$FM objective results in more distinct flows, resulting in images which are more representative of their respective classes.
  • Figure 2: Contrastive Flow-Matching intrinsically separates flows between classes. We train a small three layer MLP flow-matching model to transport between a two dimensional multivariate noise distribution (violet) and two independent blue and orange class distributions respectively. The class distributions are designed to have $\sim 50\%$ overlap, and we plot the learned class-conditioned flows between noise samples and each respective class distribution using class colors. Top: Flow-matching models learn overlapping transports between distributions, generating outputs that lie in ambiguous regions between the two classes. Bottom: Contrastive flow-matching models have significantly more discriminative flows, generating class-coherent samples while reducing ambiguity.
  • Figure 3: Contrastive flow-matching ($\Delta$FM) denoises significantly more efficiently than flow-matching. We visualize the expected final image estimated by a flow-model when denoised every 5 steps for trajectories of length 30 steps using the SDE Euler-Maruyama sampler and do not use classifier guidance. We compare the trajectories of a REPA SiT-XL/2 yu2024representation trained on ImageNet-256 imagenet for 400K steps with flow-matching (FM), and the same model trained with the contrastive flow-matching ($\Delta$FM) objective. We show these trajectories in sets of pairs generated from the same noise sample during inference, with the flow-matching model above our $\Delta$FM version.
  • Figure 4: $\Delta$FM requires significantly fewer training iterations and inference-time denoising steps. We plot FID-50k on ImageNet 256x256 with different numbers of training iterations and denoising steps. We see that $\Delta$FM outperforms the baseline with 9$\times$ fewer training iterations and 5$\times$ reduction in the number of inference-time denoising steps, indicating that $\Delta$FM is more efficient in both training and inference.
  • Figure 5: