ControlFill: Spatially Adjustable Image Inpainting from Prompt Learning
Boseong Jeon
TL;DR
ControlFill enables encoder-free, diffusion-based image inpainting by learning two concept prompts for removal and creation, and introduces spatially varying guidance to reflect multiple user intentions within a single inference. Built on Stable Diffusion with LoRA, it trains the UNet and two embeddings in two stages to maintain controllability while reducing memory and compute demands. Empirical results show improved object removal quality and context-consistent content generation compared with SDXL Inpaint and LaMa, indicating strong potential for on-device editing. The work also acknowledges limitations in specifying exact object classes and points to future directions in class-specific prompt learning for even finer control.
Abstract
In this report, I present an inpainting framework named \textit{ControlFill}, which involves training two distinct prompts: one for generating plausible objects within a designated mask (\textit{creation}) and another for filling the region by extending the background (\textit{removal}). During the inference stage, these learned embeddings guide a diffusion network that operates without requiring heavy text encoders. By adjusting the relative significance of the two prompts and employing classifier-free guidance, users can control the intensity of removal or creation. Furthermore, I introduce a method to spatially vary the intensity of guidance by assigning different scales to individual pixels.
