PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control
Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu
TL;DR
PainterNet introduces a plug-and-play inpainting framework for diffusion models that leverages local textual prompts, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to improve semantic alignment and detail in masked regions. It adds a dual-branch architecture and integrates with a frozen base model via zero convolutions, enabling dense pixel-wise control while preserving portability. The authors also propose PainterData, a diversified training dataset with localized prompts and varied mask types, and PainterBench, a realism-focused benchmark with localized prompts and masks to assess real-world applicability. Experimental results show PainterNet surpassing state-of-the-art methods on key metrics for local and global text consistency and detail preservation, with strong generalization to multiple downstream styles.
Abstract
Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
