Table of Contents
Fetching ...

SSFam: Scribble Supervised Salient Object Detection Family

Zhengyi Liu, Sheng Deng, Xinrui Wang, Linbo Wang, Xianyong Fang, Bin Tang

TL;DR

This work proposes an SSSOD family based on SAM, named SSFam, for the combination input with different modalities, demonstrating the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods.

Abstract

Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. https://github.com/liuzywen/SSFam

SSFam: Scribble Supervised Salient Object Detection Family

TL;DR

This work proposes an SSSOD family based on SAM, named SSFam, for the combination input with different modalities, demonstrating the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods.

Abstract

Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods. https://github.com/liuzywen/SSFam
Paper Structure (24 sections, 12 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 24 sections, 12 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Scribble supervised salient object detection family for unimodal, bimodal, and trimodal images.
  • Figure 2: The proposed scribble supervised salient object detection model. In the encoding part, a shared and frozen SAM encoder is used to extract modal-agnostic features, and some modal-aware modulators are designed to obtain modal-specific ones. Both features are aggregated by element-wise scaled sum in each block. Last, the features from two modalities are summed up. In the decoding part, a siamese decoder which consists of a decoder with prompts and a decoder with no prompt is proposed to transfer the parameters with prompts in the training to the ones with no prompt in the testing. When testing, the decoder with no prompt is only needed.
  • Figure 3: Comparisons of overall performance in RGB, RGB-D, and RGB-T scribble supervised salient object detection methods. No V-D-T SSSOD methods are compared because we are the first to introduce the scribble supervised method in V-D-T SOD.
  • Figure 4: Visual comparisons with RGB SSSOD competitors in some challenging cases: reflecting object (1st row), fine-grained object (2nd row), multiple objects (3rd-4th rows), and small objects (5th-6th rows).
  • Figure 5: Visual comparisons with RGB-D SSSOD competitors in some challenging cases: small objects (1st-2nd rows), multiple objects (3rd-4th rows), hollow objects (5th-6th rows), and light interference (7th-8th rows).
  • ...and 2 more figures