Table of Contents
Fetching ...

Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

Haipeng Zhou, Honqiu Wang, Tian Ye, Zhaohu Xing, Jun Ma, Ping Li, Qiong Wang, Lei Zhu

TL;DR

The paper tackles video shadow detection by introducing a diffusion-based framework that leverages temporal guidance and boundary information. It contributes the Dual Scale Aggregation to fuse short-term and long-term cues, a Shadow Boundary-Aware Attention to exploit edge context, and Space-Time Encoded Embedding to condition diffusion with timeline data. Extensive experiments on the ViSha dataset demonstrate state-of-the-art performance and strong ablations confirm the effectiveness of each component. The work advances diffusion modeling for video understanding and delivers a practical, efficient pipeline for robust shadow segmentation with publicly released code.

Abstract

Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \url{https://github.com/haipengzhou856/TBGDiff}.

Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

TL;DR

The paper tackles video shadow detection by introducing a diffusion-based framework that leverages temporal guidance and boundary information. It contributes the Dual Scale Aggregation to fuse short-term and long-term cues, a Shadow Boundary-Aware Attention to exploit edge context, and Space-Time Encoded Embedding to condition diffusion with timeline data. Extensive experiments on the ViSha dataset demonstrate state-of-the-art performance and strong ablations confirm the effectiveness of each component. The work advances diffusion modeling for video understanding and delivers a practical, efficient pipeline for robust shadow segmentation with publicly released code.

Abstract

Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \url{https://github.com/haipengzhou856/TBGDiff}.
Paper Structure (33 sections, 12 equations, 10 figures, 7 tables)

This paper contains 33 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Workflow of our $TBGDiff$. We first use an Encoder $E$ to represent all the frames, then the yielded features are sent to $DSA$ module to aggregate temporal features. The outputs of $DSA$ can be decoded as pseudo masks and boundary masks via an $Auxiliary \, \, Head$. For a frame from the sequence, we use $SBAA$ to further explore the shadow boundary context with given the boundary mask $\hat{b}_{T}$, pseudo mask $\dot{y}_{T}$, and aggregated feature $F^{DSA}_{T}$. Such that, the tokens produced by $SBAA$ and timeline guidance generated by $GE$ can be used for Diffusion to conduct video shadow detection.
  • Figure 2: Illustration of our SBAA. By integrating the $\hat{b}_{T}$ and $F_{T}^{DSA}$, we can obtain the boundary-aware embedded tokens serving as the query. We also use the pseudo mask to weight the coarse shadow regions via element-wise multiplying these tokens to produce key and value. Such that, we can implement attention mechanism and FFN to output the boundary-aware and shadow-oriented features.
  • Figure 3: Three different ways to produce guidance for conditional Diffusion. (a) PCE simply concatenates the predicted masks to current features as the temporal guidance. (b) PEE adopts the past encoded embedding as guidance which is more robust. (c) STEE encodes the pseudo masks and image pairs in both past and future to guide the Diffusion.
  • Figure 4: Visual comparisons with state-of-the-art methods. Apparently, our predicted masks show fewer noises and more accurate boundary correlation to shadows. See more compared results in our Supplementary Material.
  • Figure 5: Grad-CAM selvaraju2017grad visualization of the readout when conducting (b) long-term and (c) short-term aggregation
  • ...and 5 more figures