Table of Contents
Fetching ...

FilterPrompt: A Simple yet Efficient Approach to Guide Image Appearance Transfer in Diffusion Models

Xi Wang, Yichen Peng, Heng Fang, Yilin Wang, Haoran Xie, Xi Yang, Chuntao Li

TL;DR

FilterPrompt introduces a pixel-space filtering strategy to guide appearance transfer in diffusion models by jointly leveraging ControlNet for structure and IP-Adapter for appearance. By applying targeted filters to feature distributions in the input, it decouples appearance from structure and biases the diffusion sampling process through filtered conditioning, achieving more precise and controllable results. Extensive quantitative and qualitative experiments across multiple domains demonstrate improved structure preservation, reduced content conflicts, and better texture/color fidelity, with user studies supporting practical usefulness. The approach is lightweight, model-agnostic, and interpretable, though identity consistency under rich semantic prompts remains a challenge for future work.

Abstract

In controllable generation tasks, flexibly manipulating the generated images to attain a desired appearance or structure based on a single input image cue remains a critical and longstanding challenge. Achieving this requires the effective decoupling of key attributes within the input image data to achieve representations accurately. Previous works have concentrated predominantly on disentangling image attributes within feature space. However, the complex distribution present in real-world data often makes the application of such decoupling algorithms to other datasets challenging. Moreover, the granularity of control over feature encoding frequently fails to meet specific task requirements. Upon scrutinizing the characteristics of various generative models, we have observed that the input sensitivity and dynamic evolution properties of the diffusion model can be effectively fused with the explicit decomposition operation in pixel space. This allows the operation that we design and use in pixel space to achieve the desired control effect on the specific representation in the generated results. Therefore, we propose FilterPrompt, an approach to enhance the effect of controllable generation. It can be universally applied to any diffusion model, allowing users to adjust the representation of specific image features in accordance with task requirements, thereby facilitating more precise and controllable generation outcomes. In particular, our designed experiments demonstrate that the FilterPrompt optimizes feature correlation, mitigates content conflicts during the generation process, and enhances the effect of controllable generation.

FilterPrompt: A Simple yet Efficient Approach to Guide Image Appearance Transfer in Diffusion Models

TL;DR

FilterPrompt introduces a pixel-space filtering strategy to guide appearance transfer in diffusion models by jointly leveraging ControlNet for structure and IP-Adapter for appearance. By applying targeted filters to feature distributions in the input, it decouples appearance from structure and biases the diffusion sampling process through filtered conditioning, achieving more precise and controllable results. Extensive quantitative and qualitative experiments across multiple domains demonstrate improved structure preservation, reduced content conflicts, and better texture/color fidelity, with user studies supporting practical usefulness. The approach is lightweight, model-agnostic, and interpretable, though identity consistency under rich semantic prompts remains a challenge for future work.

Abstract

In controllable generation tasks, flexibly manipulating the generated images to attain a desired appearance or structure based on a single input image cue remains a critical and longstanding challenge. Achieving this requires the effective decoupling of key attributes within the input image data to achieve representations accurately. Previous works have concentrated predominantly on disentangling image attributes within feature space. However, the complex distribution present in real-world data often makes the application of such decoupling algorithms to other datasets challenging. Moreover, the granularity of control over feature encoding frequently fails to meet specific task requirements. Upon scrutinizing the characteristics of various generative models, we have observed that the input sensitivity and dynamic evolution properties of the diffusion model can be effectively fused with the explicit decomposition operation in pixel space. This allows the operation that we design and use in pixel space to achieve the desired control effect on the specific representation in the generated results. Therefore, we propose FilterPrompt, an approach to enhance the effect of controllable generation. It can be universally applied to any diffusion model, allowing users to adjust the representation of specific image features in accordance with task requirements, thereby facilitating more precise and controllable generation outcomes. In particular, our designed experiments demonstrate that the FilterPrompt optimizes feature correlation, mitigates content conflicts during the generation process, and enhances the effect of controllable generation.
Paper Structure (16 sections, 6 equations, 9 figures, 1 table)

This paper contains 16 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The comparison of generated results. Our approach FilterPrompt enables appearance transfer in multiple domains at local, object-centric, and full-graph levels. Compared to previous works like Cross-Image Cross-imageAttention, IP-Adapter Ip-adapter and baseline (IP-adapter Ip-adapter +ControlNet ControlNet), our approach can help the model better preserve the geometric properties of structural images while maintaining consistent color distribution and texture features with appearance images.
  • Figure 2: Our FilterPrompt. When diffusion models extract image features, strategically incorporating filtering operations enables targeted suppression or enhancement of particular feature distributions. The filters enhance the performance of diffusion models to improve the quality of generated images.
  • Figure 3: Filter impact on sampling inference process. After applying a Gaussian filter, the underlying texture in the sampled images changes from a distribution resembling arc patterns to a point-like distribution. Additionally, as shown in the enlarged illustration on the right, it is evident that the use of filters consistently disrupts the expression of redundant pattern features.
  • Figure 4: (a) shows the illustration of filter's impact on the corresponding sampling inference stages in the static generative model and the dynamic generative model. (b) gives a comparison of the results obtained by applying filter to some works of generative models. The gray background represents traditional work based on GAN and AE architectures FastStylePredict2017_CycleGANCVPR2022_DualStyleGAN. The yellow background represents work based on Diffusion INSTIp-adapter. Comparing the results, we can intuitively see that filter operation has a more significant impact on diffusion models.
  • Figure 5: Our framework. The experiment uses ControlNet and IP-Adapter as the baseline and adds combined filtering operations as the expansion. We mapped low-level features in appearance images to global embeddings as $Cs$, concatenating them with SDM default text prompt embeddings $Ct$. The denoising generation processes these parts separately. A segment is managed by ControlNet, projecting latent distributions into a fused distribution controlled by high-level features that is $Cc$. The other part uses IP-Adapter for decoding and guiding low-level feature generation. Intermediate hidden state $x_{t-1}$ from both processes are weighted and summed every sampling time.
  • ...and 4 more figures