Table of Contents
Fetching ...

Filter-Guided Diffusion for Controllable Image Generation

Zeqi Gu, Ethan Yang, Abe Davis

TL;DR

This paper introduces Filter-Guided Diffusion (FGD), a training-free black-box method that conditions diffusion-based image generation on the structure of a guide image by applying a fast, edge-preserving filter between diffusion iterations. Guidance is applied gradually to denoised estimates using a joint bilateral tensor, enabling continuous control over structure preservation and color adaptation while supporting non-deterministic sampling for diverse outputs. Through quantitative metrics (CLIP and DINO) and qualitative results, FGD demonstrates superior efficiency and translation quality across multiple datasets, with favorable comparisons to both black-box and white-box baselines and clear advantages in speed and memory usage. The approach offers localized edit capabilities via masking and establishes a practical, extensible baseline for graphically guided diffusion, with potential extensions to other filters and forms of guidance.

Abstract

Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training-free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that leverages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA methods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experiments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks. Project page: https://filterguideddiffusion.github.io/

Filter-Guided Diffusion for Controllable Image Generation

TL;DR

This paper introduces Filter-Guided Diffusion (FGD), a training-free black-box method that conditions diffusion-based image generation on the structure of a guide image by applying a fast, edge-preserving filter between diffusion iterations. Guidance is applied gradually to denoised estimates using a joint bilateral tensor, enabling continuous control over structure preservation and color adaptation while supporting non-deterministic sampling for diverse outputs. Through quantitative metrics (CLIP and DINO) and qualitative results, FGD demonstrates superior efficiency and translation quality across multiple datasets, with favorable comparisons to both black-box and white-box baselines and clear advantages in speed and memory usage. The approach offers localized edit capabilities via masking and establishes a practical, extensible baseline for graphically guided diffusion, with potential extensions to other filters and forms of guidance.

Abstract

Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training-free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that leverages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA methods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experiments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks. Project page: https://filterguideddiffusion.github.io/
Paper Structure (22 sections, 15 equations, 10 figures, 1 table)

This paper contains 22 sections, 15 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: We present Filter-Guided Diffusion (FGD), a training-free black box method for conditioning image diffusion on the structure of example images. FGD works by adding a fast filtering step between each iteration of the diffusion process. Building on classic image processing theory, we can design this filtering step to preserve the structure of a given guide image. The grid on the far left shows the effects of classic bilateral filters with different parameters on an image. The middle grid shows the result of using the same filters for guidance with FGD (in this case, using an empty prompt). The spatial filter size, which controls the scale of blur in a traditional bilateral, determines the spatial scale at which the network may vary from structure of the input image. The value filter size, which controls edge preservation in the traditional bilateral, determines how much the diffusion process should respect the edges of the guide. In addition to the filter's parameters, we can control the overall strength of guidance to determine how closely the result should resemble the input. The right shows two examples of image-to-image translation with varying guidance strength. Results become closer to the guide image as guidance strength increases to the right.
  • Figure 2: Guidance in $s_t$ vs. $x_t$. Here, we show guidance in $x_t$ (top row) vs. $s_t$ (bottom row), along with the corresponding decoded intermediate latents across time-steps. Since our guide image is closer in distribution to the predicted image $s_t$ than $x_t$, performing filtering in $s_t$ is much less likely to push our diffusion process away from the variance schedule used during training. As a result, observe that guidance in $x_t$ produces over-smoothing artifacts while guidance in $s_t$ is able to add greater detail such as the grill marks while also preserving the structure (last column and inlet).
  • Figure 3: Effects of $\delta$. Increasing the guidance strength parameter $\delta$ causes generated images to take on more of the guide image structure. The results here use a joint bilateral tensor (bottom left) derived from a cat guide image (top left) with $\sigma_{s}=2$, $\sigma_{v}=1$, and $t_{end}=15$.
  • Figure 4: Normalization for Color Distribution Shifts. We show 3 typical scenarios for normalization: In the first row, FGD without normalization closely preserves the overall color of the guide image which is often desirable when the color from the guide image is a good match for the prompt such as translating between dogs and cats. In the second row, adding normalization improves the result by allowing for colors that better suit the prompt. In the third row, normalization is necessary to transform the color in order to satisfy the full prompt.
  • Figure 5: Masking for Localized Edits. With masking, our method can make precise local edits specified by a user. For each mask, green indicates where the image should be guided by FGD, blue indicates where to give the model freedom to generate anything without guidance, and red indicates where to strictly preserve the contents of the guide or other optional image. Notably, our masked guidance differs from in-painting, shown in the rightmost column, which does not adhere to the layout of the original cat.
  • ...and 5 more figures