Training-free Content Injection using h-space in Diffusion Models

Jaeseok Jeong; Mingi Kwon; Youngjung Uh

Training-free Content Injection using h-space in Diffusion Models

Jaeseok Jeong, Mingi Kwon, Youngjung Uh

TL;DR

We address the challenge of controllability in pretrained diffusion models by proposing InjectFusion, a training-free method that injects content from a content image into a target image through blending the ${\bm{h}}$-space bottleneck and calibrating skip connections. The approach uses normalized spherical interpolation (Slerp) to blend ${\bm{h}}$-space features and introduces latent calibration to stabilize the reverse process, enabling realistic content transfer without fine-tuning or external networks. Quantitative (FID, ID, GRAM) and qualitative analyses demonstrate that InjectFusion preserves original image content while integrating content from the exemplar, outperforming related methods, particularly when color distributions differ significantly. The method expands the practical toolkit for diffusion-model-based editing, offering a lightweight, training-free path to content-aware image synthesis with broad applicability and local control via masks, though it faces limitations from the relatively low spatial resolution of ${\bm{h}}$-space and challenges with out-of-domain content.

Abstract

Diffusion models (DMs) synthesize high-quality images in various domains. However, controlling their generative process is still hazy because the intermediate variables in the process are not rigorously studied. Recently, the bottleneck feature of the U-Net, namely $h$-space, is found to convey the semantics of the resulting image. It enables StyleCLIP-like latent editing within DMs. In this paper, we explore further usage of $h$-space beyond attribute editing, and introduce a method to inject the content of one image into another image by combining their features in the generative processes. Briefly, given the original generative process of the other image, 1) we gradually blend the bottleneck feature of the content with proper normalization, and 2) we calibrate the skip connections to match the injected content. Unlike custom-diffusion approaches, our method does not require time-consuming optimization or fine-tuning. Instead, our method manipulates intermediate features within a feed-forward generative process. Furthermore, our method does not require supervision from external networks. The code is available at https://curryjung.github.io/InjectFusion/

Training-free Content Injection using h-space in Diffusion Models

TL;DR

-space bottleneck and calibrating skip connections. The approach uses normalized spherical interpolation (Slerp) to blend

-space features and introduces latent calibration to stabilize the reverse process, enabling realistic content transfer without fine-tuning or external networks. Quantitative (FID, ID, GRAM) and qualitative analyses demonstrate that InjectFusion preserves original image content while integrating content from the exemplar, outperforming related methods, particularly when color distributions differ significantly. The method expands the practical toolkit for diffusion-model-based editing, offering a lightweight, training-free path to content-aware image synthesis with broad applicability and local control via masks, though it faces limitations from the relatively low spatial resolution of

-space and challenges with out-of-domain content.

Abstract

-space, is found to convey the semantics of the resulting image. It enables StyleCLIP-like latent editing within DMs. In this paper, we explore further usage of

-space beyond attribute editing, and introduce a method to inject the content of one image into another image by combining their features in the generative processes. Briefly, given the original generative process of the other image, 1) we gradually blend the bottleneck feature of the content with proper normalization, and 2) we calibrate the skip connections to match the injected content. Unlike custom-diffusion approaches, our method does not require time-consuming optimization or fine-tuning. Instead, our method manipulates intermediate features within a feed-forward generative process. Furthermore, our method does not require supervision from external networks. The code is available at https://curryjung.github.io/InjectFusion/

Paper Structure (42 sections, 16 equations, 37 figures, 3 tables, 2 algorithms)

This paper contains 42 sections, 16 equations, 37 figures, 3 tables, 2 algorithms.

Introduction
Background
Diffusion models and controllability
Injecting contents from exemplar images
Style transfer
Denoising Diffusion Implicit Model (DDIM)
Asymmetric reverse process (Asyrp)
Method
Role of h-space
Preserving statistics with Slerp
Latent calibration
Full generative process
Experiments
Setting
Metrics
...and 27 more sections

Figures (37)

Figure 1: Overview of InjectFusion. During the content injection, the bottleneck feature map is recursively injected during the sampling process started from the inverted ${\bm{x}}_T$ of images. The target content is reflected in the result images while preserving the original images.
Figure 2: Illustration of content injection methods. (a) and (b) provide content injection but suffer quality degradation. Compared to them, (c) allows successful content injection by preserving statistics in DMs and gradually increasing the ratio of the target content.
Figure 3: Preliminary experiment. Naïve replacement of ${\bm{h}}$ somehow combines the content and the original image. However, it severely degrades image quality.
Figure 4: Improvement in quality with Slerp. (a) shows the result of ${\bm{h}_t}+{\bm{h}_t^{content}}$. It has some artifacts. (b) shows the result of Slerp with $\gamma=0.5$ brings better quality. Techniques described later are not applied here for fair comparison.
Figure 5: Correlation between ${\bm{h}_t}$ and skip connection.${\bm{h}_t}$ is highly correlated with the matching skip connection. (a) illustrates examples of matching and non-matching skip connections. (b) shows correlation between each $\tilde{{\bm{h}}}_t$ and skip connection. r is Pearson correlation coefficient and p-values of r are less than 1e-15. Non-matching skip connections seriously distort the correlation.
...and 32 more figures

Training-free Content Injection using h-space in Diffusion Models

TL;DR

Abstract

Training-free Content Injection using h-space in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (37)