Table of Contents
Fetching ...

VeCoR - Velocity Contrastive Regularization for Flow Matching

Zong-Wei Hong, Jing-lun Li, Lin-Ze Li, Shen Zhang, Yao Tang

TL;DR

VeCoR introduces Velocity Contrastive Regularization to Flow Matching by adding a bidirectional attract–repel signal in velocity space. Negative velocity candidates, generated via augmentation-like perturbations across image, latent, and velocity domains, regularize trajectory evolution and reduce off-manifold drift. Empirically, VeCoR yields significant FID improvements and faster convergence on ImageNet-1K 256×256 and MS-COCO text-to-image tasks, with strong gains for lightweight models and low-function-evaluation budgets. The approach remains plug-and-play, data-efficient, and requires no architectural changes, offering a practical pathway to more stable and high-fidelity flow-based generation.

Abstract

Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/

VeCoR - Velocity Contrastive Regularization for Flow Matching

TL;DR

VeCoR introduces Velocity Contrastive Regularization to Flow Matching by adding a bidirectional attract–repel signal in velocity space. Negative velocity candidates, generated via augmentation-like perturbations across image, latent, and velocity domains, regularize trajectory evolution and reduce off-manifold drift. Empirically, VeCoR yields significant FID improvements and faster convergence on ImageNet-1K 256×256 and MS-COCO text-to-image tasks, with strong gains for lightweight models and low-function-evaluation budgets. The approach remains plug-and-play, data-efficient, and requires no architectural changes, offering a practical pathway to more stable and high-fidelity flow-based generation.

Abstract

Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/

Paper Structure

This paper contains 23 sections, 33 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Supervision and trajectory behavior.Left—Standard Flow Matching (SFM): trained only with positive supervision toward the ground-truth velocity (blue), the predicted trajectory (purple) may slightly deviate from the data manifold, sometimes leading to less stable generations. Right—VeCoR: by contrastively suppressing negative trajectories (red path and ×), VeCoR adds negative supervision that discourages off-manifold deviations and guides trajectories back toward the data manifold, improving stability and perceptual fidelity.
  • Figure 2: VeCoR refines strong SiT baselines by suppressing negative trajectories and improving stability and perceptual fidelity. Although SiT already produces plausible ImageNet-1K 256$\times$256 samples, its sampling trajectories can still drift from the ground truth, causing color/contrast shifts, geometric distortions, blur, and artifacts; VeCoR reduces these issues under identical sampling (same seed, 50 NFEs, Euler--Maruyama). (a) Color/contrast: VeCoR yields a more saturated, uniform sky and wolf hues closer to the ground truth. (b) Geometric consistency: SiT bends the boat and distorts the lamp shade, while VeCoR produces a level hull and a lamp shade closer to the true shape. (c) Deblurring/sharpening: previously soft boundaries become crisp. (d) Artifact removal: SiT hallucinates extraneous structures (e.g., a mechanical arm near the spire; a protrusion above the bird’s beak), whereas VeCoR removes them, restoring clean, plausible shapes and textures.
  • Figure 3: Overview of the proposed Velocity-Contrastive Regularization (VeCoR) framework. VeCoR enhances flow matching (FM) by introducing a balanced, bidirectional supervision mechanism in the velocity space. Instead of relying solely on positive guidance toward the ground-truth flow, VeCoR incorporates complementary contrastive cues that define counter-directional references across multiple representational domains. These perturbations—spanning (I) image, (II) latent, and (III) velocity spaces—are implemented through lightweight, augmentation-like transformations that preserve semantic consistency while altering dynamic behaviors. The resulting positive and negative velocities, $\hat{v}_{+}$ and $\hat{v}_{-}$, jointly guide the model-predicted velocity $v_\theta$ toward stable and coherent dynamics while discouraging drifts toward unstable regions. The visualization (bottom right) illustrates how negative velocity guidance can induce off-manifold deviations, leading to degraded sample quality.
  • Figure 4: Qualitative comparison between REPA and our REPA-based method (VeCoR) in terms of training convergence and denoising efficiency. We compare the images generated by two SiT-XL/2 + REPA models during the first 400K iterations, one of which integrates our method, VeCoR. Both models share the same noise, sampler, and number of sampling steps, and neither uses classifier-free guidance. The left panel shows results at different training iterations. While REPA demonstrates effectiveness in accelerating convergence, our VeCoR further improves the convergence speed. The right panel illustrates the denoising process, showing that our method not only enhances training convergence but also enables the model to predict more reliable velocity fields and reconstruct the data manifold more accurately under low-step settings.
  • Figure 5: Ablation on the regularization coefficient $\lambda$. Comparison illustrating that a moderate $\lambda$ (=0.05) yields the most natural and detailed images, while smaller or larger values cause artifacts or over-smoothed geometry.
  • ...and 2 more figures