Table of Contents
Fetching ...

TReFT: Taming Rectified Flow Models For One-Step Image Translation

Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai

TL;DR

This paper tackles the bottleneck of multi-step denoising in Rectified Flow (RF) models for one-step image translation. It introduces TReFT, a simple yet effective finetuning strategy that directly uses the velocity predicted by pretrained DiT/UNet at the final denoising stage, enabling real-time one-step translation. The authors provide theoretical backing (Theorem 1 and Theorem 2) showing that the RF velocity converges to the final clean latent as denoising nears completion, justifying the one-shot output approach. With latent-cycle losses and lightweight architectural tweaks, TReFT achieves competitive or state-of-the-art results on multiple unpaired and paired translation benchmarks while maintaining fast inference, demonstrating practical impact for real-time image translation with pretrained RF models.

Abstract

Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

TReFT: Taming Rectified Flow Models For One-Step Image Translation

TL;DR

This paper tackles the bottleneck of multi-step denoising in Rectified Flow (RF) models for one-step image translation. It introduces TReFT, a simple yet effective finetuning strategy that directly uses the velocity predicted by pretrained DiT/UNet at the final denoising stage, enabling real-time one-step translation. The authors provide theoretical backing (Theorem 1 and Theorem 2) showing that the RF velocity converges to the final clean latent as denoising nears completion, justifying the one-shot output approach. With latent-cycle losses and lightweight architectural tweaks, TReFT achieves competitive or state-of-the-art results on multiple unpaired and paired translation benchmarks while maintaining fast inference, demonstrating practical impact for real-time image translation with pretrained RF models.

Abstract

Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

Paper Structure

This paper contains 29 sections, 75 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: FID score calculated during training on Horse2zebra dataset. The experiment names indicate the pretrained models used: SD2.1rombach2022high, PixArtchen2023pixart, FLUXFLUX_website, and SD-Turbosauer2024adversarial. The suffixes “Vanilla” and “TReFT” denote the applied finetuning strategies, while the prefix “PerFlow” means the model is first finetuned using PerFlow yan2024perflow. Please zoom in for details. See Appendix Sec. \ref{['sec:sup_exp_fig_1']} for detailed experimental implementation.
  • Figure 2: Comparison Between TReFT and Previous Paradigms. (a) Diffusion models using the Vanilla method (e.g., CycleGAN-Turboimg2img-turbo) take $z_1^a$ and timestep $t=0$ as input and output the one-step denoised image $\hat{z}_1^b$. (b) RF models using the Vanilla method. (c) RF models with TReFT take $z_1^a$ and timestep $t=1$ as input, and directly treat the prediction $v$ as the output $\hat{z}_1^b$. Happy: Easy to converge. Sad: Difficult to converge. Note: For simplicity, timesteps are unified. Here, $t=0$ is the state of pure noise, while $t=1$ corresponds to the clean image without noise.
  • Figure 3: Pathways to Predict $\hat{z}_1^b$ in Latent Space for Three Methods: Vanilla, Inversion, and TReFT (Ours). The four types of the lines represent: pretrained flow (blue) from the pretrained RF model, initial flow (red) roughly aligned with its tangent direction, target flow (yellow) during training, and flow transition (grey) from initial to target. The three ellipse areas denote: noise distribution $N(0,I)$ (light red), source domain $p(a)$ (light blue), and target domain $p(b)$ (violet). The three methods illustrated are: (a) Vanilla: one-step denoising using standard rectified flow scheduler. (b) Inversion: one-step inversion followed by one-step denoising. (c) TReFT (ours): directly applies $v_{\theta}(z_1^a, 1)$ for translation. Note: To visualize $\hat{z}1^b$, $v{\theta}(z_1^a, 1)$ is shifted to start at the origin. This illustration is based on Sec. \ref{['sec:method-Preliminaries']}, Sec. \ref{['sec:treft']} and Fig. \ref{['fig:figure-10']}.
  • Figure 4: The Cosine Similarity in VAE Latent Space. To evaluate the cosine similarity in VAE latent space, we conduct the experiment on 3,582 original–edited image pairs from the InstructPix2Pix CLIP-filtered Datasetbrooks2023instructpix2pixInstructpix2pix_Clip_Filtered_dataset using the VAE of SD3.5-Largeesser2024scaling, and visualize the results as histograms. The blue histograms (Vanilla) indicate that the pretrained and target flows are nearly orthogonal, whereas the red histograms (TReFT) reveal that the directions of the original and edited image latents are closely aligned.
  • Figure 5: The norm of $\hat{z}_0$ and cosine similarity between $v(z_t, t)$ and $z_1$ at different timesteps, with the visualizations of $v(z_t, t)$. The image sequence above is generated directly by passing $v(z_t, t)$ through the VAE decoder. In the lower plot picture, the red curve corresponds to the left vertical axis and represents the norm of $\hat{z}_0$ at each timestep. The blue curve corresponds to the right vertical axis and shows cosine similarity between $v(z_t, t)$ and $z_1$ at different timesteps. This experiment is conducted on SD3.5-Largeesser2024scaling, sampling 50 steps to generate $1024 \times 1024 \times 3$ images on 1000 different prompts.
  • ...and 12 more figures