COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

Xinrui Zu; Qian Tao

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

Xinrui Zu, Qian Tao

TL;DR

This paper introduces Contrastive Optimal Transport Flow (COT Flow), a method that unifies optimal transport with diffusion/flow-based generative models to achieve fast, high-quality sampling from arbitrary priors and enhanced zero-shot editing. Central to the approach are COT Pairs, a training scheme that leverages entropic OT trajectories and contrastive-like encodings, and the COT Editor, which enables flexible editing via dual-channel inputs and self-augmentation. The method addresses the generative learning trilemma by directly learning the transport flow between unpaired sources, enabling one-step sampling and competitive unpaired I2I translation quality, while offering new zero-shot editing capabilities such as COT composition and shape-texture coupling. Empirical results demonstrate strong performance on standard unpaired I2I tasks and showcase versatile editing scenarios, with ablations confirming the benefits of the proposed COT Pair design and training strategy. Overall, COT Flow provides a practical, OT-grounded pathway to fast and flexible generative modeling and editing.

Abstract

Diffusion models have demonstrated strong performance in sampling and editing multi-modal data with high generation quality, yet they suffer from the iterative generation process which is computationally expensive and slow. In addition, most methods are constrained to generate data from Gaussian noise, which limits their sampling and editing flexibility. To overcome both disadvantages, we present Contrastive Optimal Transport Flow (COT Flow), a new method that achieves fast and high-quality generation with improved zero-shot editing flexibility compared to previous diffusion models. Benefiting from optimal transport (OT), our method has no limitation on the prior distribution, enabling unpaired image-to-image (I2I) translation and doubling the editable space (at both the start and end of the trajectory) compared to other zero-shot editing methods. In terms of quality, COT Flow can generate competitive results in merely one step compared to previous state-of-the-art unpaired image-to-image (I2I) translation methods. To highlight the advantages of COT Flow through the introduction of OT, we introduce the COT Editor to perform user-guided editing with excellent flexibility and quality. The code will be released at https://github.com/zuxinrui/cot_flow.

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

TL;DR

Abstract

Paper Structure (28 sections, 1 theorem, 25 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 28 sections, 1 theorem, 25 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Background
Optimal Transport
Contrastive Learning
Consistency Models
Method
Similarities between Contrastive Learning and Consistency Models
COT Pairs
COT Training
COT Editor
Experiments
Unpaired Image-to-image Translation
COT Editor Scenarios
Ablation Studies
Conclusion
...and 13 more sections

Key Result

Proposition 3.1

Let $\pi^*$ be the OT plan between $\mu(\mathbf{x})$ and $\nu(\mathbf{y})$. Let the OT map $T^*$ recovers $\pi^*$. The augmentation defined by Eq.eq:ot_trajectory using $T^*$ samples the same probability as the dynamic extension of the EOT plan $\pi^*_{\lambda}$ with $\lambda=2\sigma^2$.

Figures (5)

Figure 1: (a). Unpaired image-to-image translation by our proposed COT Flow, with one-step or multi-step sampling. (b). Our proposed COT Editor enables zero-shot image editing with high flexibility. COT composition (middle panel) allows users to composite elements and synthesize realistic images. Shape-texture coupling (right panel) allows users to separately draw or use shapes and textures as dual inputs, to generate fused images with high quality.
Figure 2: (a). The generative learning trilemma. Current generative methods still cannot simultaneously satisfy the three performance indicators: high quality, fast sampling, and mode coverage. (b). Recent developments of the diffusion/flow-based generative models, including iDDPMNichol2021, EDMKarras2022, DDIMSong2020, DPMLu2022, Progressive Distillation (PD)Salimans2022, Consistency Distillation (CD)Song2023, VP ODESong2020a, Flow Matching (FM)Lipman2022, Conditional Flow Matching (CFM)Tong2023, Rectified Flow (RF)Liu2022, Stable Diffusion v3 (SDv3)Esser2024 All methods implicitly approach the OT formulation, either by sampling straight trajectories or avoiding crossing between the trajectories through various techniques.)
Figure 3: Left: The sampling strategy of our method. Given an input $\mathbf{x}$, we can generate the target data $\tilde{\mathbf{y}}$ with one-step sampling $\tilde{\mathbf{y}}=\mathcal{E}_{\theta}(\mathbf{x},1)$, or optionally multi-step sampling using Eq.\ref{['eq:self-augmentation']}/\ref{['eq:sampling']}, where the intermediates $\tilde{\mathbf{x}}_{t_k}$ are the augmentations between the source input $\mathbf{x}$ and the generated target $\tilde{\mathbf{y}}$. Right: Three scenarios of the proposed COT Editor, some of which have dual-channel inputs as extensions to the current editing methods. (a). COT composition. Given a target image $\mathbf{y}$ with an edited component or mask $\mathbf{m}$, we use the guidance $\mathbf{y}^{(g)}$$=\mathbf{y}\oplus\mathbf{m}$ as the single input and synthesize the output $\tilde{\mathbf{y}}$ by Eq.\ref{['eq:cot_inpainting']}. (b). Shape-texture coupling. With a drawn stroke image $\hat{\mathbf{x}}_1$ and a texture image $\hat{\mathbf{x}}_2$, the output $\tilde{\mathbf{y}}$ consists of both features. (c). COT augmentation. Given a series of auto-detected cardiac-cycle edges $\{\hat{\mathbf{x}}^{(a)}\}$ and a single MRI $\mathbf{y}$, we can generate a cycle of cardiac MRI $\{\tilde{\mathbf{y}}\}$ with the same movements of $\{\hat{\mathbf{x}}^{(a)}\}$ and style of $\mathbf{y}$.
Figure 4: Generation comparison between our method (bottom row) and SDEdit (middle row) on CelebA male$\to$female (64$\times$64), handbag$\to$shoes (64$\times$64), and outdoor$\to$church (128$\times$128). We use one-step sampling in our method and set $t=500$ of the reverse diffusion process in SDEdit to perform the results.
Figure 5: Zero-shot image editing comparison between our method (COT Editor) and SDEdit on CelebA male$\to$female (64$\times$64), handbag$\to$shoes (64$\times$64), and outdoor$\to$church (128$\times$128). We use one-step and multi-step sampling in our method and set $t=300,400,500,600$ of the reverse diffusion process in SDEdit to perform the editing results.

Theorems & Definitions (2)

Proposition 3.1: Eq.\ref{['eq:ot_trajectory']} estimates the dynamic extension of the OT plan
proof

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

TL;DR

Abstract

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)