Table of Contents
Fetching ...

One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Haipeng Liu, Yang Wang, Meng Wang

TL;DR

NTN-Diff tackles the dual challenges of preserving unmasked regions and achieving semantic consistency between masked and unmasked areas in text-guided image inpainting. It introduces a null-text-null frequency-aware diffusion framework that decouples semantics by frequency bands and diffusion stages, using a three-branch early-stage denoising (low- and mid-frequency) guided by null-text and text prompts, followed by a late-stage refinement with unmasked-region preservation. Core ideas include adaptive low- and mid-frequency masking via DCT-based band separation, replacement of bands across branches, and a final text-guided pass that enforces cross-band semantic alignment. Empirical results on BrushBench and EditBench show NTN-Diff outperforms state-of-the-art diffusion models in both inpainting and outpainting, with Ablation studies validating the contributions of each denoising pathway and the importance of adaptive frequency extraction; code is released for reproducibility.

Abstract

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

TL;DR

NTN-Diff tackles the dual challenges of preserving unmasked regions and achieving semantic consistency between masked and unmasked areas in text-guided image inpainting. It introduces a null-text-null frequency-aware diffusion framework that decouples semantics by frequency bands and diffusion stages, using a three-branch early-stage denoising (low- and mid-frequency) guided by null-text and text prompts, followed by a late-stage refinement with unmasked-region preservation. Core ideas include adaptive low- and mid-frequency masking via DCT-based band separation, replacement of bands across branches, and a final text-guided pass that enforces cross-band semantic alignment. Empirical results on BrushBench and EditBench show NTN-Diff outperforms state-of-the-art diffusion models in both inpainting and outpainting, with Ablation studies validating the contributions of each denoising pathway and the importance of adaptive frequency extraction; code is released for reproducibility.

Abstract

Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

Paper Structure

This paper contains 25 sections, 11 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a-c) t-SNE visualization of CLIP latent space evolution during text-guided denoising. At each step, the Euclidean distance between the text prompt and the denoised image (Denoised Image-Text Distance) reflects semantic consistency across masked and unmasked regions—a smaller value indicates better alignment. As a reference, we include the distance between the ground truth and the text prompt (red dashed line). To assess unmasked region preservation, we also compute the distance between the denoised image and ground truth (Denoised Image-GT Distance). (d) Comparison between our NTN-Diff (Fig.\ref{['fig:model']}) and state-of-the-arts avrahami2023blendedju2024brushnet for text-guided inpainting;
  • Figure 2: We investigate the text-guided denoising process for both the (a) low-frequency and (b) mid-frequency bands. For each step, we employ the red bounding boxes to highlight the variations for the low-frequency band during the late stage in (a) and the blue bounding boxes to visualize the variations for layout information in (b). For DCT spectrum, the top-left region represents low frequencies, with the bottom-right region corresponds to high frequencies. The dark red and yellow indicate the highest and lowest value.
  • Figure 3: Illustration of our proposed NTN-Diff pipeline, which comprises a (i@) null-text denoising process (Sec.\ref{['sec:null1']}) to avoid being influenced by text prompts, and a (ii@) text-guided denoising process (Sec.\ref{['sec:text1']}) to denoise the masked regions, while replacing the low-frequency band from the denoised output with that from the above null-text denoising process. Building on this, we further utilize the denoised mid-frequency to guide another (iii@) null-text denoising process (Sec.\ref{['sec:null2']}) by substituting the mid-frequency band from this process. Additionally, a (iv@) late-stage text-guided denoising process (Sec.\ref{['sec:late']}) is performed, along with the substitution of unmasked regions from the early stage of the diffusion process, to preserve unmasked regions at each step.
  • Figure 4: Illustration of (a) denoised low-frequency band layer and (b) mid-frequency band layer.
  • Figure 5: Comparison of the text-guided inpainted results with the state-of-the-arts on BrushBench ju2024brushnet and EditBench wang2023imagen. NTN-Diff delivers the superior inpainted results over others.
  • ...and 6 more figures