One-Step Image Translation with Text-to-Image Models

Gaurav Parmar; Taesung Park; Srinivasa Narasimhan; Jun-Yan Zhu

One-Step Image Translation with Text-to-Image Models

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, Jun-Yan Zhu

TL;DR

This work presents CycleGAN-Turbo and pix2pix-Turbo, one-step image-to-image translation methods that adapt pre-trained diffusion backbones via adversarial learning to tasks with and without paired data. By direct conditioning input, end-to-end architecture, LoRA adapters, and skip connections, the approach preserves input structure while enabling fast inference and strong translation quality, often matching or surpassing GAN-based and diffusion-based baselines. Extensive experiments on day-night, weather, and sketch/edge-to-image tasks show substantial speed advantages and competitive results, with robust ablations confirming the importance of key design choices. The findings suggest that one-step diffusion models can serve as versatile backbones for a range of GAN objectives, enabling real-time, flexible image translation with relatively small fine-tuning footprint.

Abstract

In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.

One-Step Image Translation with Text-to-Image Models

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 18 figures, 6 tables)

This paper contains 16 sections, 5 equations, 18 figures, 6 tables.

Introduction
Related Work
Method
Adding Conditioning Input
Preserving Input Details
Unpaired Training
Extensions
Experiments
Comparison to Unpaired Methods
Ablation Study
Extensions
Discussion and Limitations
Additional Ablation Study
Additional Baseline Comparisons
Additional Analysis
...and 1 more sections

Figures (18)

Figure 1: We present a general method for adapting a single-step diffusion model, such as SD-Turbo sauer2023adversarial, to new tasks and domains through adversarial learning. This enables us to leverage the internal knowledge of pre-trained diffusion models while achieving efficient inference (e.g., 0.3 seconds for 512x512 image). Our single-step image-to-image translation models, called CycleGAN-Turbo and pix2pix-Turbo, can synthesize realistic outputs for unpaired (top) and paired settings (bottom), respectively, on various tasks.
Figure 2: Our generator architecture. We tightly integrate three separate modules in the original latent diffusion models into a single end-to-end network with small trainable weights. This architecture allows us to translate the input image $x$ to the output $y$, while retaining the input scene structure. We use LoRA adapters hu2021lora in each module, introduce skip connections and Zero-Convs zhang2023adding between input and output, and retrain the first layer of the U-Net. Blue boxes indicate trainable layers. Semi-transparent layers are frozen. The same generator can be used for various GAN objectives.
Figure 3: (Left) The one-step model learns to map the input noise to the output image. Note that the features of SD2.1-Turbo forms a coherent layout (a) from the noise map. (Right) Unfortunately, adding condition encoder branches zhang2023addingmou2023t2i causes conflicts, since features (b) from the new branch represent a different layout compared to the original feature (a). This conflict deteriorates the downstream feature (c) in the SD-Turbo Decoder, affecting the output quality. The feature maps are visualized with PCA.
Figure 4: Skip Connections help retain details. We visualize the outputs of our day-to-night models trained with and without skip connections. It is clearly seen that adding skip connections preserves the details of the input daytime image. The zoomed in crops of the night images are gamma-adjusted by 1.5 for easier visualization.
Figure 5: Comparison to baselines on 256 $\times$ 256 datasets. We compare our unpaired method to CUT park2020contrastive and Instruct-pix2pix brooks2022instructpix2pix, the best-performing GAN-based and diffusion methods, respectively. CUT outputs images that often contain severe image artifacts. Whereas, Instruct-pix2pix fails to preserve the input image structure.
...and 13 more figures

One-Step Image Translation with Text-to-Image Models

TL;DR

Abstract

One-Step Image Translation with Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)