Table of Contents
Fetching ...

Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video

Guiqiu Liao, Matjaz Jogan, Sai Koushik, Eric Eaton, Daniel A. Hashimoto

TL;DR

ViDeo Spatio-Temporal disentanglement Networks (VDST-Net) is introduced, a framework to disentangle complex spatio-temporal object interactions using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs).

Abstract

Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.

Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video

TL;DR

ViDeo Spatio-Temporal disentanglement Networks (VDST-Net) is introduced, a framework to disentangle complex spatio-temporal object interactions using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs).

Abstract

Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring an extensive training dataset of object masks, relying instead on coarse video labels indicating object presence. Current state-of-the-art methods either require multiple independent stages of processing that employ motion cues or, in the case of end-to-end trainable networks, lack in segmentation accuracy, in part due to the difficulty of learning segmentation maps from videos with transient object presence. This limits the application of WSVOS for semantic annotation of surgical videos where multiple surgical tools frequently move in and out of the field of view, a problem that is more difficult than typically encountered in WSVOS. This paper introduces Video Spatio-Temporal Disentanglement Networks (VDST-Net), a framework to disentangle spatiotemporal information using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network designed to resolve temporal conflicts when specifics about object location and timing in the video are not provided works with a student network that integrates information over time by leveraging temporal dependencies. We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60\% of annotated frames. Our method outperforms state-of-the-art techniques and generates superior segmentation masks under video-level weak supervision.
Paper Structure (23 sections, 4 equations, 6 figures, 8 tables)

This paper contains 23 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Three types of weakly supervised detection/segmentation in video. Type I has image-level presence labels available, and research for this task usually builds on WSSS adding temporal constraints to improve prediction. Type II - only has video-level labels, and the object is assumed to be in the video for most of the frames (this is the scenario for YouTube-object dataset). Type III - is the most challenging, where the label only indicates the presence of objects for the whole video, yet each object may be present in the video temporarily (i.e., in only a subset of frames).
  • Figure 2: Left: Our approach deploying knowledge distillation to disentangle spatial and temporal information for weakly supervised learning with video-level labels. Ground truth video presence labels $P_g$ provide supervision to both teacher and student, while knowledge in activation map $M^t$ is transferred from teacher to student. Right: Ranked spatial and temporal pooling captures information about multiple objects in a frame while filtering out spurious information from frames where a target object is missing.
  • Figure 3: Activation maps of surgical video clips from different methods (MCTformer, TCAM, and our VDST-Net).
  • Figure 4: Segmentation and detection performance on Youtube Objects data. The second and third columns are the activation maps and post-processed binary masks of our method. In the last column results are taken from belharbi2023tcam.
  • Figure 5: Qualitative results of ablation study and final segmentation masks.
  • ...and 1 more figures