Table of Contents
Fetching ...

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Nicola Fanelli, Gennaro Vessio, Giovanna Castellano

TL;DR

This work introduces text-guided multi-mask inpainting, enabling simultaneous filling of multiple image regions each governed by its own prompt. It combines automatic object grounding and object-level captioning with a fine-tuned multimodal language model to generate per-mask prompts, then applies rectified cross-attention within a diffusion model to enforce region-specific prompts in a single inpainting pass. The pipeline demonstrates plausible, creative results on digitized art from WikiArt and photographic scenes in the DCI dataset, and shows that multi-mask prompt generation benefits from multi-task training of the LLM and region-aligned diffusion. Practically, this enables automated data augmentation and advanced image editing capabilities while highlighting the need to consider potential misuse and domain transfer effects.

Abstract

Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at https://cilabuniba.github.io/i-dream-my-painting.

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

TL;DR

This work introduces text-guided multi-mask inpainting, enabling simultaneous filling of multiple image regions each governed by its own prompt. It combines automatic object grounding and object-level captioning with a fine-tuned multimodal language model to generate per-mask prompts, then applies rectified cross-attention within a diffusion model to enforce region-specific prompts in a single inpainting pass. The pipeline demonstrates plausible, creative results on digitized art from WikiArt and photographic scenes in the DCI dataset, and shows that multi-mask prompt generation benefits from multi-task training of the LLM and region-aligned diffusion. Practically, this enables automated data augmentation and advanced image editing capabilities while highlighting the need to consider potential misuse and domain transfer effects.

Abstract

Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at https://cilabuniba.github.io/i-dream-my-painting.

Paper Structure

This paper contains 34 sections, 3 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: We present a pipeline to address a novel task we refer to as text-guided multi-mask inpainting. In this task, given an image with multiple masked regions, we aim to inpaint all regions simultaneously, with each region guided by its text prompt. Moreover, we demonstrate that it is possible to generate creative and plausible multi-mask text prompts automatically, starting solely from the masked image, thereby fully automating the inpainting process.
  • Figure 2: Overview of our automatic annotation process. We input artwork images into Kosmos-2 to obtain bounding box annotations for the main objects in each image. Afterward, we provide cropped images of these objects to LLaVA to generate more detailed object-level captions for our dataset.
  • Figure 3: Overview of our pipeline. We utilize LLaVA as a prompt generator for multi-mask inpainting, integrating the generated multi-mask prompts and inpainting layout with Stable Diffusion through rectified cross-attention. The only trainable components of the pipeline are a QLoRA adapter for LLaVA, with the model quantized to 4 bits, along with a LoRA adapter that modifies the cross-attention layers of Stable Diffusion to accommodate the new task.
  • Figure 4: Effects of LLaVA sampling temperature on the quality and diversity of prompt generation. For CLIPSim, we report the similarity value scaled by 2.5 to map it into the $[0, 1]$ interval, allowing for easier comparison, as done in hessel2021clipscore.
  • Figure 5: Qualitative results. We compare the different approaches tested in our evaluation. Fine-tuning with RCA enables the model to better follow prompts, capture details, and respect inpainting locations while reducing susceptibility to concept blending. The second example is from DCI.
  • ...and 12 more figures