Table of Contents
Fetching ...

SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

Yonghui Wang, Shaokai Liu, Li Li, Wengang Zhou, Houqiang Li

TL;DR

SwinShadow tackles the challenging problem of detecting adjacent and especially ambiguous adjacent shadows by combining local and shifted window attention in a Swin Transformer-based encoder-decoder. The architecture introduces a Deep Supervision module to strengthen shadow features early, a Double Attention mechanism to unify local and shifted attention in decoding, and a Multi-Level Aggregation strategy to fuse multi-scale features for precise mask prediction. Empirical results on SBU, UCF, and ISTD show state-of-the-art BER performance and robust handling of adjacent shadows, with ablations confirming the contributions of DS, DA, and MLA. The work advances shadow detection by explicitly leveraging local context and surrounding cues, offering practical benefits for downstream vision tasks in complex scenes.

Abstract

Shadow detection is a fundamental and challenging task in many computer vision applications. Intuitively, most shadows come from the occlusion of light by the object itself, resulting in the object and its shadow being contiguous (referred to as the adjacent shadow in this paper). In this case, when the color of the object is similar to that of the shadow, existing methods struggle to achieve accurate detection. To address this problem, we present SwinShadow, a transformer-based architecture that fully utilizes the powerful shifted window mechanism for detecting adjacent shadows. The mechanism operates in two steps. Initially, it applies local self-attention within a single window, enabling the network to focus on local details. Subsequently, it shifts the attention windows to facilitate inter-window attention, enabling the capture of a broader range of adjacent information. These combined steps significantly improve the network's capacity to distinguish shadows from nearby objects. And the whole process can be divided into three parts: encoder, decoder, and feature integration. During encoding, we adopt Swin Transformer to acquire hierarchical features. Then during decoding, for shallow layers, we propose a deep supervision (DS) module to suppress the false positives and boost the representation capability of shadow features for subsequent processing, while for deep layers, we leverage a double attention (DA) module to integrate local and shifted window in one stage to achieve a larger receptive field and enhance the continuity of information. Ultimately, a new multi-level aggregation (MLA) mechanism is applied to fuse the decoded features for mask prediction. Extensive experiments on three shadow detection benchmark datasets, SBU, UCF, and ISTD, demonstrate that our network achieves good performance in terms of balance error rate (BER).

SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection

TL;DR

SwinShadow tackles the challenging problem of detecting adjacent and especially ambiguous adjacent shadows by combining local and shifted window attention in a Swin Transformer-based encoder-decoder. The architecture introduces a Deep Supervision module to strengthen shadow features early, a Double Attention mechanism to unify local and shifted attention in decoding, and a Multi-Level Aggregation strategy to fuse multi-scale features for precise mask prediction. Empirical results on SBU, UCF, and ISTD show state-of-the-art BER performance and robust handling of adjacent shadows, with ablations confirming the contributions of DS, DA, and MLA. The work advances shadow detection by explicitly leveraging local context and surrounding cues, offering practical benefits for downstream vision tasks in complex scenes.

Abstract

Shadow detection is a fundamental and challenging task in many computer vision applications. Intuitively, most shadows come from the occlusion of light by the object itself, resulting in the object and its shadow being contiguous (referred to as the adjacent shadow in this paper). In this case, when the color of the object is similar to that of the shadow, existing methods struggle to achieve accurate detection. To address this problem, we present SwinShadow, a transformer-based architecture that fully utilizes the powerful shifted window mechanism for detecting adjacent shadows. The mechanism operates in two steps. Initially, it applies local self-attention within a single window, enabling the network to focus on local details. Subsequently, it shifts the attention windows to facilitate inter-window attention, enabling the capture of a broader range of adjacent information. These combined steps significantly improve the network's capacity to distinguish shadows from nearby objects. And the whole process can be divided into three parts: encoder, decoder, and feature integration. During encoding, we adopt Swin Transformer to acquire hierarchical features. Then during decoding, for shallow layers, we propose a deep supervision (DS) module to suppress the false positives and boost the representation capability of shadow features for subsequent processing, while for deep layers, we leverage a double attention (DA) module to integrate local and shifted window in one stage to achieve a larger receptive field and enhance the continuity of information. Ultimately, a new multi-level aggregation (MLA) mechanism is applied to fuse the decoded features for mask prediction. Extensive experiments on three shadow detection benchmark datasets, SBU, UCF, and ISTD, demonstrate that our network achieves good performance in terms of balance error rate (BER).
Paper Structure (16 sections, 4 equations, 12 figures, 6 tables)

This paper contains 16 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (a) and (b) are adjacent shadows that the object and its shadow are contiguous. The adjacent shadow in (a) has high contrast while (b) has lower contrast, which we refer to these two situation as normal adjacent shadow and ambiguous adjacent shadow, respectively.
  • Figure 2: Adjacent shadows in shadow detection. Compared to row 1, the objects in rows 2, 3, and 4 have lower contrast with their shadows. As such, a more complex situation is that the color of the objects are darker than their shadows. Therefore, lacking attention on the objects and their adjacent regions will lead to mistaken shadow detection results.
  • Figure 3: (a) Local window. Region R is a part of the black object and region A would give a strong indication that R is a shadow. (b) Shifted window. After shifting the window, region R can acquire more adjacent information (B, C, and D), giving R different cues that the properties of R and C (another part of the black object) are the same and R is a non-shadow region.
  • Figure 4: Schematic illustration of the proposed SwinShadow. We first split an image into fixed-size patches and linearly embed each of them. Then we feed the sequence of vectors to a Swin Transformer encoder liu2021swin. To obtain accurate shadow detection results, we use the deep supervision (DS) module to process patch partition features and double attention (DA) module to process high-level features. Last, a multi-level feature aggregation (MLA) mechanism is applied to fuse the features, and we use these features for the final output.
  • Figure 5: Deep supervision module. First, image features from the patch partition module of the Swin Transformer encoder liu2021swin are processed by two convolutions followed by an upsampling operation to obtain predicted map, and we add supervision to this map. Then the predicted map is passed to a sigmoid activation function to obtain attention map. Finally, we use this attention map to enhance our features and acquire deep supervision (DS) feature through a skip connection.
  • ...and 7 more figures