Table of Contents
Fetching ...

Video Instance Shadow Detection Under the Sun and Sky

Zhenghao Xing, Tianyu Wang, Xiaowei Hu, Haoran Wu, Chi-Wing Fu, Pheng-Ann Heng

TL;DR

ViShadow is introduced, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training and is demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Abstract

Instance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training. ViShadow features a two-stage training pipeline: the first stage, utilizing labeled image data, identifies shadow and object instances through contrastive learning for cross-frame pairing. The second stage employs unlabeled videos, incorporating an associated cycle consistency loss to enhance tracking ability. A retrieval mechanism is introduced to manage temporary disappearances, ensuring tracking continuity. The SOBA-VID dataset, comprising unlabeled training videos and labeled testing videos, along with the SOAP-VID metric, is introduced for the quantitative evaluation of VISD solutions. The effectiveness of ViShadow is further demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Video Instance Shadow Detection Under the Sun and Sky

TL;DR

ViShadow is introduced, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training and is demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Abstract

Instance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training. ViShadow features a two-stage training pipeline: the first stage, utilizing labeled image data, identifies shadow and object instances through contrastive learning for cross-frame pairing. The second stage employs unlabeled videos, incorporating an associated cycle consistency loss to enhance tracking ability. A retrieval mechanism is introduced to manage temporary disappearances, ensuring tracking continuity. The SOBA-VID dataset, comprising unlabeled training videos and labeled testing videos, along with the SOAP-VID metric, is introduced for the quantitative evaluation of VISD solutions. The effectiveness of ViShadow is further demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.
Paper Structure (32 sections, 5 equations, 11 figures, 3 tables)

This paper contains 32 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Based on the video clip presented in (a), our ViShadow framework demonstrates a robust ability to detect, segment, associate, and track the dancer along with her shadow (b). This capability facilitates a range of applications, such as duplicating the dancer with her shadow to create captivating visual effects (c).
  • Figure 2: Visual comparison of results produced by our method and SSIS+Mask2former on typical scenarios. Each row displays three frames from a video clip. Detected shadows and objects that are associated are marked in the same color. The combination of SSIS and Mask2Former is limited in its ability to track undefined-category objects and out-of-view object/shadow.
  • Figure 3: The schematic of our semi-supervised video instance shadow detection framework, ViShadow. The top stage involves supervised learning from labeled images, while the bottom stage employs self-supervised learning from unlabeled videos. NMS denotes non-maximum suppression.
  • Figure 4: The schematic illustration of the proposed bidirectional retrieving mechanism.
  • Figure 5: Example sequences demonstrate the application of video instance shadow detection in video inpainting.
  • ...and 6 more figures