A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Junhao Zhuang; Yanhong Zeng; Wenran Liu; Chun Yuan; Kai Chen

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, Kai Chen

TL;DR

PowerPaint introduces a diffusion-based inpainting framework that unifies context-aware filling and text-guided object synthesis through learnable task prompts. By training two prompts, P_obj and P_ctxt, alongside a P_shape for shape-guided inpainting and employing prompt interpolation and classifier-free guidance, the model achieves state-of-the-art results across object inpainting, object removal, and shape-constrained editing within a single system. Extensive experiments on OpenImages, MSCOCO, Places2, and Flickr-Scenery demonstrate robust performance, with ablations confirming the value of task-specific prompts and unified training. The work enables versatile image editing workflows and highlights practical applications, including controllable shape fitting and compatibility with ControlNet, with code and models released for public use.

Abstract

Advancing image inpainting is challenging as it requires filling user-specified regions for various intents, such as background filling and object synthesis. Existing approaches focus on either context-aware filling or object synthesis using text descriptions. However, achieving both tasks simultaneously is challenging due to differing training strategies. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in multiple inpainting tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Moreover, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting, enhancing the model's applicability in shape-guided applications. Finally, we conduct extensive experiments and applications to verify the effectiveness of PowerPaint. We release our codes and models on our project page: https://powerpaint.github.io/.

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 20 figures, 7 tables)

This paper contains 21 sections, 7 equations, 20 figures, 7 tables.

Introduction
Related Work
PowerPaint
Preliminary
Learning with Task Prompts
Implementation Details
Experiments
Comparisons with State-of-the-Art
Ablation Study
Applications and Limitations
Conclusions
Appendix
Qualitative Comparisons
Text-guided object inpainting.
Shape-guided object inpainting.
...and 6 more sections

Figures (20)

Figure 1: PowerPaint is the first versatile image inpainting model that simultaneously achieves state-of-the-art results in various inpainting tasks, including text-guided object inpainting, object removal, shape-guided object inpainting with controllable shape-fitting, outpainting, etc. [Best viewed in color with zoom-in]
Figure 2: Overview. PowerPaint fine-tunes a text-to-image model with two learnable task prompts, i.e., $\mathbf{P_{obj}}$ and $\mathbf{P_{ctxt}}$, for text-guided object inpainting and context-aware image inpainting, respectively. After training, $\mathbf{P_{obj}}$ can be further used as a negative prompt with classifier-free guidance sampling for effective object removal. We further introduce $\mathbf{P_{shape}}$ for shape-guided object inpainting, which can be extended by prompt interpolation with $\mathbf{P_{ctxt}}$ to control the degree of shape-fitting for object inpainting.
Figure 3: To remove objects from crowded image context, the commercial product, Adobe Firefly adobefirefly2023, tends to copy from the context (as circled in the green bounding box), while PowerPaint successfully erases the objects.
Figure 4: Illustration of prompt interpolation. To enable object inpainting with a controllable shape-fitting degree, we randomly expand the object segmentation mask and interpolate $\mathbf{P_{ctxt}}$ and $\mathbf{P_{shape}}$ according to the expanded area ratio.
Figure 5: Compared with SOTA approaches, PowerPaint shows better text alignment and visual quality for text-guided object inpainting.
...and 15 more figures

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

TL;DR

Abstract

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Authors

TL;DR

Abstract

Table of Contents

Figures (20)