Table of Contents
Fetching ...

CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

Waqas Ahmed, Dean Diepeveen, Ferdous Sohel

TL;DR

This paper addresses the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects by exploiting the multimodal capabilities of a pre-trained text-to-image diffusion model.

Abstract

Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

TL;DR

This paper addresses the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects by exploiting the multimodal capabilities of a pre-trained text-to-image diffusion model.

Abstract

Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.
Paper Structure (20 sections, 7 equations, 7 figures, 6 tables)

This paper contains 20 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Architectural Overview. From a composite image $I_c$ and corresponding object masks $O_m$, a Shadow-Box Predictor estimates per-object shadow bounding boxes, which are quantized into grid bins and converted into shadow positional tokens. These positional tokens are inserted into the prompt and processed through a CLIP encoder to provide text-conditioned spatial grounding. On the image pathway, a Feature Encoder extracts multi-scale visual features and injects them into the diffusion UNet. Both conditioning streams jointly guide the diffusion model to generate the final shadowed image $I_g$.
  • Figure 2: ViP-LLaVA object naming from bounding-box prompts. Given the prompt "Name the objects in the bounding boxes.", the model generates: (Left: "Girl riding a motorbike"), (Center: "Man in blue shirt." and "Woman in white dress."), (Right: "The pole." and "Second pole from the left.").
  • Figure 3: Visual comparison with state-of-the-art baseline methods for single object shadow generation. Our method, given the same image inputs plus a compact text prompt of category terms and shadow positional tokens, produces better shadows for all objects (e.g., Row 1: "a woman casting shadow [sx_8][sy_9][sx_3][sy_8]"). Positional tokens are inserted automatically; see Sec. \ref{['sec:text_layout']}.
  • Figure 4: Visual comparison with state-of-the-art baseline methods for multiple object shadow generation. Our method, given the same image inputs plus a compact text prompt of category terms and shadow positional tokens, produces better shadows for all objects (e.g., Row 1: "a girl casting shadow [sx_3][sy_8][sx_8][sy_9]; a boy casting shadow [sx_4][sy_9][sx_10][sy_11]"). Positional tokens are inserted automatically; see Sec. \ref{['sec:text_layout']}.
  • Figure 5: Visual comparison with state-of-the-art baseline methods on real composited images. Our method, given the same image inputs plus a compact text prompt of category terms and shadow positional tokens, produces better shadows for all objects (e.g., Row 1: "a ball casting shadow […]; a bird casting shadow […], an cone object casting shadow […]"). Positional tokens are inserted automatically; see Sec. \ref{['sec:text_layout']}.
  • ...and 2 more figures