Salient Object-Aware Background Generation using Text-Guided Diffusion Models

Amir Erfan Eshratifar; Joao V. B. Soares; Kapil Thadani; Shaunak Mishra; Mikhail Kuznetsov; Yueh-Ning Ku; Paloma de Juan

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

Amir Erfan Eshratifar, Joao V. B. Soares, Kapil Thadani, Shaunak Mishra, Mikhail Kuznetsov, Yueh-Ning Ku, Paloma de Juan

TL;DR

The paper tackles salient object outpainting by identifying object expansion as a key failure mode when using standard inpainting diffusion models to generate backgrounds. It proposes a ControlNet-augmented extension of Stable Inpainting 2.0 that conditioning on both the salient object mask and the masked image to constrain object boundaries, and introduces an automated object-expansion metric based on SAM-derived masks: $E = area(m_{o} \cup m_{i}) - area(m_{i})$. Across multiple datasets, the approach reduces object expansion by an average of 3.6× compared to SI2 while maintaining standard visual metrics, and training with COCO data improves background diversity. The work has practical implications for e-commerce and design by enabling more faithful background generation around salient subjects, and it discusses future directions including non-salient object backgrounds and alternative control architectures.

Abstract

Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

TL;DR

. Across multiple datasets, the approach reduces object expansion by an average of 3.6× compared to SI2 while maintaining standard visual metrics, and training with COCO data improves background diversity. The work has practical implications for e-commerce and design by enabling more faithful background generation around salient subjects, and it discusses future directions including non-salient object backgrounds and alternative control architectures.

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 3 tables)

This paper contains 13 sections, 4 equations, 6 figures, 3 tables.

Introduction
Related Work
Diffusion Models
Text-guided Image Inpainting
Salient Object Outpainting
Stable Inpainting
ControlNet for Stable Inpainting
Measuring Object Expansion
Experiments
Experimental Procedure
Results
Ablation Studies
Conclusions and Future Work

Figures (6)

Figure 1: Examples of outpainting a salient object (leftmost column) using the Stable Inpainting 2.0 (SI2) model (columns 2, 4, 6 from left) and using our proposed model (columns 3, 5, 7 from left). The images in each paired column (2 & 3, 4 & 5, 6 & 7) are generated using the same seed and prompt, but one uses SI2, and the other uses our model. Objects are often expanded using the SI2 model, which may catastrophically change the object's identity. For example, the legs of the tables are expanded in the first two rows; in the third row, a bench is transformed into a bed; in the last row, a swan is blended into a rock and a bed.
Figure 2: Significant object expansion is seen at the bottom of the white dresser with RunwayML's Background Remix, a popular commercial tool. These examples are generated with the prompt of "a modern room."
Figure 3: The proposed architecture for salient object outpainting. The original ControlNet architecture only works with text-to-image Stable Diffusion. To make it compatible with text-to-image Stable Inpainting we modified the ControlNet's U-Net architecture to take two extra inputs: 1) mask and 2) masked image. The blue region denotes the frozen Stable Inpainting's U-Net model. The red region includes replicas of the encoder layers of the blue region. The zero convolution outputs from the red region modulate the outputs of decoder layers in the blue region. Initially, during training, the modulation has no effect on the output as the weights of the convolution layer are initialized to zero. Gradually, during training, the nuances of the task of background generation for salient objects will be encoded in modulated values.
Figure 4: Pipeline for computing salient object masks of the original image ($m_i$) and the outpainted image ($m_o$) to measure object expansion. We found that existing salient object segmentation (SOS) models underperform on synthetic images, but the Segment Anything Model (SAM) works robustly. Therefore, we (i) obtain the salient object mask of the original image using the SOS model, (ii) sample random points from the mask, and (iii) pass sampled point coordinates as the input point prompt to SAM to obtain the salient object mask $m_o$. We obtain a new salient mask from SAM for the original image ($m_i$) as well for an apples-to-apples comparison with $m_o$.
Figure 5: Controlling the strength of ControlNet using the adjustable weight $w$ at inference time. With $w=0.0$, objects can expand freely. Setting $w=1.0$ aggressively prevents expansion.
...and 1 more figures

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

TL;DR

Abstract

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)