Table of Contents
Fetching ...

Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling

Xinhao Tao, Junyan Cao, Yan Hong, Li Niu

TL;DR

This work tackles the realism gap in image composition by generating plausible foreground shadows for inserted objects. It introduces a large-scale rendered RdSOBA dataset to augment limited real-data sources and a two-stage DMASNet architecture that first predicts a decomposed shadow mask (box and shape) and then fills the shadow with attention to background shadow pixels. The approach demonstrates superior visual realism and strong cross-domain transfer to real composite images, outperforming baselines on multiple metrics and in human studies. The combined dataset and method offer practical improvements for realistic image editing and synthesis in applications requiring coherent shadows across diverse scenes.

Abstract

Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.

Shadow Generation with Decomposed Mask Prediction and Attentive Shadow Filling

TL;DR

This work tackles the realism gap in image composition by generating plausible foreground shadows for inserted objects. It introduces a large-scale rendered RdSOBA dataset to augment limited real-data sources and a two-stage DMASNet architecture that first predicts a decomposed shadow mask (box and shape) and then fills the shadow with attention to background shadow pixels. The approach demonstrates superior visual realism and strong cross-domain transfer to real composite images, outperforming baselines on multiple metrics and in human studies. The combined dataset and method offer practical improvements for realistic image editing and synthesis in applications requiring coherent shadows across diverse scenes.

Abstract

Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadows for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset, we create a large-scale dataset called RdSOBA with rendering techniques. Moreover, we design a two-stage network named DMASNet with decomposed mask prediction and attentive shadow filling. Specifically, in the first stage, we decompose shadow mask prediction into box prediction and shape prediction. In the second stage, we attend to reference background shadow pixels to fill the foreground shadow. Abundant experiments prove that our DMASNet achieves better visual effects and generalizes well to real composite images.
Paper Structure (18 sections, 7 equations, 5 figures, 3 tables)

This paper contains 18 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The example data for shadow generation task. The left two examples are from our RdSOBA dataset and the right two examples are from DESOBA dataset sgrnet. In each example, the background object (resp., shadow) mask is outlined in green (resp., blue) and the foreground object (resp., shadow) mask is outlined in red (resp., yellow).
  • Figure 2: The architecture of our proposed DMASNet. In the first stage, we employ $E_c$ to extract $F_e$, based on which the box head $H_b$ and the shape head $H_s$ jointly predict the rough mask $\hat{M}_{fs}$. By using the decoder feature from $D_t$, we refine $\hat{M}_{fs}$ to get $\hat{M}_{fs}^{'}$. In the second stage, we employ $E_s$ to extract $F_s$, based on which we calculate the attention map $A$ within background shadow region to get the target mean value for foreground shadow pixels. To match the target mean value, we scale $I_c$ to get $I_{dark}$. Finally we use $\hat{M}_{fs}^{'}$ to combine $I_{dark}$ with $I_c$ to get the final result $\hat{I}_{g}$.
  • Figure 3: Example results of different methods in the setting of DESOBA $\rightarrow$ DESOBA.
  • Figure 4: Example results of a comprehensive comparison between our DMASNet and SGRNet.
  • Figure 5: Example results of different methods on real composite images.