Table of Contents
Fetching ...

SONIC: Spectral Optimization of Noise for Inpainting with Consistency

Seungyeon Baek, Erqun Dong, Shadan Namazifard, Mark J. Matthews, Kwang Moo Yi

TL;DR

This work tackles image inpainting with off-the-shelf diffusion/flow models by optimizing the initial seed noise rather than modifying the model or performing extensive training. It introduces a linearized trajectory approximation to avoid back-propagating through the denoiser and advocates spectral-domain updates for stable convergence, paired with gradient-masking and latent-space fills to maintain the seed within a valid manifold. The approach achieves state-of-the-art results on FFHQ, DIV2K, and BrushBench across multiple perceptual and human-alignment metrics, demonstrating strong generalization to diverse masks without task-specific training. The method offers a practical, training-free pathway to high-quality inpainting with broad applicability to other inverse problems in the diffusion-model era.

Abstract

We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

SONIC: Spectral Optimization of Noise for Inpainting with Consistency

TL;DR

This work tackles image inpainting with off-the-shelf diffusion/flow models by optimizing the initial seed noise rather than modifying the model or performing extensive training. It introduces a linearized trajectory approximation to avoid back-propagating through the denoiser and advocates spectral-domain updates for stable convergence, paired with gradient-masking and latent-space fills to maintain the seed within a valid manifold. The approach achieves state-of-the-art results on FFHQ, DIV2K, and BrushBench across multiple perceptual and human-alignment metrics, demonstrating strong generalization to diverse masks without task-specific training. The method offers a practical, training-free pathway to high-quality inpainting with broad applicability to other inverse problems in the diffusion-model era.

Abstract

We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

Paper Structure

This paper contains 37 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Teaser -- We propose a novel training-free method of inpainting that focuses exclusively on the initial seed noise. (Top row) We show the denoising result of an initial seed noise, as we optimize the seed noise using our method. We optimize the seed noise to faithfully regenerate the non-masked regions of the input image, so as to obtain more consistent inpainting results. (Bottom row) Inpainting results of competing methods, with our final result on the right.
  • Figure 2: Initial seed noise already determines content -- We show example inpainting outcomes of BLD Avrahami_2023 with StableDiffusion 3.5 backbone stable-diffusion-3.5-medium using different initial seed noise. We show how different initial seeds denoise to different scene compositions even with the same prompt, and the final inpainting outcomes with each seed. Note how the content within the inpainted region, while modified, share similar structure as the denoised initial seed.
  • Figure 3: Optimizing in the spectral domain is important -- We show examples of how the optimized initial seed noise denoises during the optimization process, when optimized to match the non-masked regions in \ref{['fig:init_noise']}, for seed A. While optimizing in the spatial domain also guides towards desired scene composition, optimizing in the spectral domain provides much more stable and robust optimization.
  • Figure 4: Method overview -- We optimize the initial seed noise in the spectral domain $\mathbf{X}_T$, starting from a random noise $\mathbf{x}_T$, such that our denoised latent matches the masked observation $\mathbf{y}$ in the latent space. To allow partial observations to be encoded, we use nearest-pixel filling before passing it into the encoder. We then compute the masked mean square error in the latent space, comparing it with a fully denoised latent and update $\mathbf{X}_T$ accordingly. Importantly, we linearize the entire $T$ step denoising process, essentially disconnecting the gradient flow passing through it. This allows us to optimize the initial seed noise $\mathbf{X}_T$without back-propagating through the denoiser.
  • Figure 5: Optimizing in the spectral domain --(Left) We show an example convergence graph of \ref{['eq:spatial_loss']} when optimizing in the spatial domain vs spectral, for the same example in \ref{['fig:freq_optimization']}. (Right) We show the final inpainting outcomes for both domains. Optimizing in the spectral domain provides a seamless inpainting outcome, whereas in the spatial domain it fails.
  • ...and 5 more figures