I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Nicola Fanelli; Gennaro Vessio; Giovanna Castellano

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Nicola Fanelli, Gennaro Vessio, Giovanna Castellano

TL;DR

This work introduces text-guided multi-mask inpainting, enabling simultaneous filling of multiple image regions each governed by its own prompt. It combines automatic object grounding and object-level captioning with a fine-tuned multimodal language model to generate per-mask prompts, then applies rectified cross-attention within a diffusion model to enforce region-specific prompts in a single inpainting pass. The pipeline demonstrates plausible, creative results on digitized art from WikiArt and photographic scenes in the DCI dataset, and shows that multi-mask prompt generation benefits from multi-task training of the LLM and region-aligned diffusion. Practically, this enables automated data augmentation and advanced image editing capabilities while highlighting the need to consider potential misuse and domain transfer effects.

Abstract

Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at https://cilabuniba.github.io/i-dream-my-painting.

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

TL;DR

Abstract

I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)