Table of Contents
Fetching ...

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu

TL;DR

PainterNet introduces a plug-and-play inpainting framework for diffusion models that leverages local textual prompts, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to improve semantic alignment and detail in masked regions. It adds a dual-branch architecture and integrates with a frozen base model via zero convolutions, enabling dense pixel-wise control while preserving portability. The authors also propose PainterData, a diversified training dataset with localized prompts and varied mask types, and PainterBench, a realism-focused benchmark with localized prompts and masks to assess real-world applicability. Experimental results show PainterNet surpassing state-of-the-art methods on key metrics for local and global text consistency and detail preservation, with strong generalization to multiple downstream styles.

Abstract

Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

TL;DR

PainterNet introduces a plug-and-play inpainting framework for diffusion models that leverages local textual prompts, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to improve semantic alignment and detail in masked regions. It adds a dual-branch architecture and integrates with a frozen base model via zero convolutions, enabling dense pixel-wise control while preserving portability. The authors also propose PainterData, a diversified training dataset with localized prompts and varied mask types, and PainterBench, a realism-focused benchmark with localized prompts and masks to assess real-world applicability. Experimental results show PainterNet surpassing state-of-the-art methods on key metrics for local and global text consistency and detail preservation, with strong generalization to multiple downstream styles.

Abstract

Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of previous inpainting architectures and PainterNet. (a) Dedicated Inpainting model based on diffusion model, enhanced by extended input channels and fine tuning. (b) We propose the plug-and-play approach PainterNet, which introduces an additional branch guidance model for hierarchical dense control via layer and attention control points.
  • Figure 2: Comparison of generation and datasets under local and global textual prompts. (a) Our method is able to generate correct results based on the local textual prompt, whereas other methods are not able to ensure consistency of generation when using global textual prompt. (b) In contrast to BrushData, our PainterData contains multiple types of masks (e.g., bounding box, irregular, segmentation-based) as well as local textual prompts generated by a multimodal large language models (MLLMs).
  • Figure 3: Overview of our method. Our PainterNet introduces an additional branch that uses a hierarchical approach to gradually incorporate the complete UNet features into the pre-trained UNet layer by layer through layers and attentional control points. Meanwhile, we designed the Actual-Token Attention Loss (ATAL) $\mathcal{L}_{\text{ATAL}}$ to direct the model's attention to the mask region. Masking Strategy generates diverse masks (e.g., bounding box $m_{box}$, irregular $m_{irr}$, and segmentation-based $m_{seg}$) and selects the input mask shape based on a random number $k \in [0, 1]$. $A_{i}, i \in [1,2,...N]$ represents the cross-attention map of the $i$-th layer, where $N$ is the total number of layers. $m_{i}$ denotes the mask $m$ resized to fit $A_{i}$. $\mathcal{L}_{\text{diff}}$ denotes diffusion loss.
  • Figure 4: Comparison of the performance of PainterNet and previous image drawing methods in various styles of drawing tasks: I, II for nature images, III and IV for cartoons, and V for illustrations.
  • Figure 5: Generative effects of our PainterNet migration to other downstream models. Model I generates anime style outputs counterfeit-v3-huggingface, Model II produces VanGogh style art huggingface_vangogh, and Model III generates specific roles (such as Iron Man) civitai_ironman.
  • ...and 3 more figures