Table of Contents
Fetching ...

Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

Xiang Gao, Zhengbo Xu, Junhan Zhao, Jiaying Liu

TL;DR

The paper addresses open-domain text-guided image-to-image translation by unifying diverse translation tasks under a single diffusion-based framework. It introduces FCDiffusion, which uses a Discrete Cosine Transform based Frequency Filtering Module to project source latent features into distinct spectral bands that serve as conditioning signals for a pre-trained Latent Diffusion Model via a ControlNet-like FCNet. Four spectral bands map to different I2I correlations—mini-band for style, low-band for style plus structure, mid-band for layout, and high-band for contours—enabling style-guided content creation, semantic manipulation, scene translation, and style translation. The approach is end-to-end trainable with multiple detachable branches, enabling inference-time switching with favorable speed and competitive quality, and opens avenues for plug-and-play spectral control in future work.

Abstract

Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.

Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

TL;DR

The paper addresses open-domain text-guided image-to-image translation by unifying diverse translation tasks under a single diffusion-based framework. It introduces FCDiffusion, which uses a Discrete Cosine Transform based Frequency Filtering Module to project source latent features into distinct spectral bands that serve as conditioning signals for a pre-trained Latent Diffusion Model via a ControlNet-like FCNet. Four spectral bands map to different I2I correlations—mini-band for style, low-band for style plus structure, mid-band for layout, and high-band for contours—enabling style-guided content creation, semantic manipulation, scene translation, and style translation. The approach is end-to-end trainable with multiple detachable branches, enabling inference-time switching with favorable speed and competitive quality, and opens avenues for plug-and-play spectral control in future work.

Abstract

Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.
Paper Structure (15 sections, 7 equations, 9 figures, 1 table)

This paper contains 15 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Our FCDiffusion adapts Stable Diffusion to versatile text-guided I2I applications via different types of frequency control, e.g., style-guided content creation realized by mini-frequency control, image semantic manipulation realized by low-frequency control, image scene translation realized by mid-frequency control, and image style translation realized by high-frequency control. Better viewed with zoom-in.
  • Figure 2: Overall architecture of FCDiffusion, as well as details of important model components.
  • Figure 3: Example text-guided I2I results of our method. Our method suits diverse I2I application scenarios emphasizing different I2I correlations simply by switching to different modes of frequency control. The mini-frequency, low-frequency, mid-frequency, and high-frequency control respectively correlates the source image and the generated image in style, style and structure, layout, and contours, realizing style-guided content creation, image semantic manipulation, image scene translation, and image style translation, respectively. Better viewed with zoom-in.
  • Figure 4: Visual comparisons of our method with related text-guided image translation methods on different I2I tasks including image semantic manipulation (top two rows), style-guided content creation (middle two rows), and image style translation (bottom two rows). Results of our method for these three tasks are obtained by switching to the low-frequency, mini-frequency, and high-frequency control branch respectively. Better viewed with zoom-in.
  • Figure 5: With low-frequency control, our method is able to manipulate image semantics under different degrees of semantic discrepancy. As the semantic gap between the source image and the target text increases, the translated image can still conform to the text with the original image style and structure preserved. Better viewed with zoom-in.
  • ...and 4 more figures