Table of Contents
Fetching ...

DeS3: Adaptive Attention-driven Self and Soft Shadow Removal using ViT Similarity

Yeying Jin, Wei Ye, Wenhan Yang, Yuan Yuan, Robby T. Tan

TL;DR

DeS3 tackles the challenging problem of removing hard, soft, and self shadows from a single image without relying on shadow masks. It fuses a diffusion-based reverse sampling framework with adaptive classifier-driven attention and a ViT-based similarity loss derived from pre-trained DINO-ViT to preserve object and scene structure during deshadowing. The method introduces conditional DDIM, CAM-guided attention, and a ViT similarity stopping criterion to robustly remove shadows while maintaining details, achieving state-of-the-art results on multiple shadow datasets without requiring masks during training or testing. The approach shows significant improvements on LRSS (e.g., RMSE reduction) and demonstrates strong generalization across hard, soft, and self shadows, with practical impact for real-world shadow removal tasks in photography and computer vision pipelines.

Abstract

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16\% of the RMSE of the whole image on the LRSS dataset. Our data and code is available at: \url{https://github.com/jinyeying/DeS3_Deshadow}

DeS3: Adaptive Attention-driven Self and Soft Shadow Removal using ViT Similarity

TL;DR

DeS3 tackles the challenging problem of removing hard, soft, and self shadows from a single image without relying on shadow masks. It fuses a diffusion-based reverse sampling framework with adaptive classifier-driven attention and a ViT-based similarity loss derived from pre-trained DINO-ViT to preserve object and scene structure during deshadowing. The method introduces conditional DDIM, CAM-guided attention, and a ViT similarity stopping criterion to robustly remove shadows while maintaining details, achieving state-of-the-art results on multiple shadow datasets without requiring masks during training or testing. The approach shows significant improvements on LRSS (e.g., RMSE reduction) and demonstrates strong generalization across hard, soft, and self shadows, with practical impact for real-world shadow removal tasks in photography and computer vision pipelines.

Abstract

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16\% of the RMSE of the whole image on the LRSS dataset. Our data and code is available at: \url{https://github.com/jinyeying/DeS3_Deshadow}
Paper Structure (13 sections, 5 equations, 13 figures, 4 tables)

This paper contains 13 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The results of SOTA supervised method wan2022sg and weakly-supervised method liu2021from in removing (a) self shadow, (b) soft shadow, and (c) hard shadow. Our DeS3 can preserve meaningful objects (duck, paper, bollard, etc.) during the reverse sampling, and achieve better shadow removal results.
  • Figure 2: The architecture and the motivation of our DeS3. (1) The forward diffusion is shown in green. The reverse sampling starts from the noise map $\mathop{\mathrm{\mathbf{x}}}\nolimits_T$ concatenated with the conditional shadow inputs $\mathop{\mathrm{\Tilde{\mathbf{x}}}}\nolimits$. Our DeS3 samples image $\mathbf{x}_t$ at each time step $t$. (2) We inject a classifier into the noise prediction network $\bm{\epsilon}_\theta(\mathop{\mathrm{\mathbf{x}}}\nolimits_t,\mathop{\mathrm{\Tilde{\mathbf{x}}}}\nolimits,\mathbf{a}_t,t)$. Adaptive attention $\mathbf{a}_t$ is progressively improved at each time step $t$. (3) To guide the reverse sampling to output the object structure features, we have the ViT similarity loss $\mathcal{L}_{\rm sim}$, extracted keys from the pre-trained DINO-ViT.
  • Figure 3: Shadows can be categorized into hard, soft, and self shadows. Left: Soft and self shadows. Right: Wrong binary masks, due to the ambiguous boundaries (the red boxes for the incorrect and blue boxes for mis-detected masks).
  • Figure 4: Adaptive attention is refined during inference.
  • Figure 5: The results of adaptive attention that enables our reverse sampling to focus on hard, soft and self shadows.
  • ...and 8 more figures