Table of Contents
Fetching ...

ControlFill: Spatially Adjustable Image Inpainting from Prompt Learning

Boseong Jeon

TL;DR

ControlFill enables encoder-free, diffusion-based image inpainting by learning two concept prompts for removal and creation, and introduces spatially varying guidance to reflect multiple user intentions within a single inference. Built on Stable Diffusion with LoRA, it trains the UNet and two embeddings in two stages to maintain controllability while reducing memory and compute demands. Empirical results show improved object removal quality and context-consistent content generation compared with SDXL Inpaint and LaMa, indicating strong potential for on-device editing. The work also acknowledges limitations in specifying exact object classes and points to future directions in class-specific prompt learning for even finer control.

Abstract

In this report, I present an inpainting framework named \textit{ControlFill}, which involves training two distinct prompts: one for generating plausible objects within a designated mask (\textit{creation}) and another for filling the region by extending the background (\textit{removal}). During the inference stage, these learned embeddings guide a diffusion network that operates without requiring heavy text encoders. By adjusting the relative significance of the two prompts and employing classifier-free guidance, users can control the intensity of removal or creation. Furthermore, I introduce a method to spatially vary the intensity of guidance by assigning different scales to individual pixels.

ControlFill: Spatially Adjustable Image Inpainting from Prompt Learning

TL;DR

ControlFill enables encoder-free, diffusion-based image inpainting by learning two concept prompts for removal and creation, and introduces spatially varying guidance to reflect multiple user intentions within a single inference. Built on Stable Diffusion with LoRA, it trains the UNet and two embeddings in two stages to maintain controllability while reducing memory and compute demands. Empirical results show improved object removal quality and context-consistent content generation compared with SDXL Inpaint and LaMa, indicating strong potential for on-device editing. The work also acknowledges limitations in specifying exact object classes and points to future directions in class-specific prompt learning for even finer control.

Abstract

In this report, I present an inpainting framework named \textit{ControlFill}, which involves training two distinct prompts: one for generating plausible objects within a designated mask (\textit{creation}) and another for filling the region by extending the background (\textit{removal}). During the inference stage, these learned embeddings guide a diffusion network that operates without requiring heavy text encoders. By adjusting the relative significance of the two prompts and employing classifier-free guidance, users can control the intensity of removal or creation. Furthermore, I introduce a method to spatially vary the intensity of guidance by assigning different scales to individual pixels.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Training and inference phases of ControlFill. (a) The model and prompts are trained to learn the two concepts of removal and creation. (b) During inference, these concepts can be applied by adjusting positive and negative prompts, without requiring a text encoder.
  • Figure 2: ControlFill can reflect two user intentions (creation/removal) into individual mask regions (A and B in this example) within a single inference without text encoders. The tuples in the bottom descriptions denote positive and negative intentions, respectively.
  • Figure 3: Generated images by inpainting with diffusion priors when different prompts given. In this case, users try to remove the dog on the chair. Using fixed prompts such as empty, background might not work for all cases, leading to unwanted objects on the chair. The best prompt in this case is related with describing the nearest unmasked region.
  • Figure 4: Object removal performance depending on the mask generation method, comparing with zhuang2023task. The left shows results from the model trained with a random mask for the removal concept, while the right demonstrates my proposed mask generation method, which strictly avoids foreground objects.
  • Figure 5: Validation images during training. The inpainting output is pasted into the masked region to observe harmonization. As training progresses, the inpainting performance improves in terms of color matching with the unmasked region. However, object removal performance can degrade.
  • ...and 2 more figures