Table of Contents
Fetching ...

SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection

Zhengyi Liu, Xinrui Wang, Xianyong Fang, Zhengzheng Tu, Linbo Wang

TL;DR

The paper tackles RGB-T salient object detection by fine-tuning Segment Anything Model (SAM) while identifying two key optimization bottlenecks: imbalance convergence between RGB and thermal modalities and significant gradient differences between high- and low-activation pathways. It introduces SAMSOD, which combines unimodal supervision, gradient deconfliction, and decoupled adapters to strengthen learning of the non-dominant modality and to balance foreground-background learning. Empirical results on RGB-T benchmarks and generalization tests to scribble-supervised RGB-T, full RGB-D SOD, and RGB-D rail defect detection demonstrate state-of-the-art performance and robust cross-domain transfer, with a lightweight variant for real-time use. The approach offers practical impact by enabling more reliable multi-modal segmentation under challenging conditions and providing a scalable framework for SAM-based RGB-D and RGB-T tasks, supported by detailed ablations and cost analyses.

Abstract

RGB-T salient object detection (SOD) aims to segment attractive objects by combining RGB and thermal infrared images. To enhance performance, the Segment Anything Model has been fine-tuned for this task. However, the imbalance convergence of two modalities and significant gradient difference between high- and low- activations are ignored, thereby leaving room for further performance enhancement. In this paper, we propose a model called \textit{SAMSOD}, which utilizes unimodal supervision to enhance the learning of non-dominant modality and employs gradient deconfliction to reduce the impact of conflicting gradients on model convergence. The method also leverages two decoupled adapters to separately mask high- and low-activation neurons, emphasizing foreground objects by enhancing background learning. Fundamental experiments on RGB-T SOD benchmark datasets and generalizability experiments on scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets and full-supervised RGB-D rail surface defect detection all demonstrate the effectiveness of our proposed method.

SAMSOD: Rethinking SAM Optimization for RGB-T Salient Object Detection

TL;DR

The paper tackles RGB-T salient object detection by fine-tuning Segment Anything Model (SAM) while identifying two key optimization bottlenecks: imbalance convergence between RGB and thermal modalities and significant gradient differences between high- and low-activation pathways. It introduces SAMSOD, which combines unimodal supervision, gradient deconfliction, and decoupled adapters to strengthen learning of the non-dominant modality and to balance foreground-background learning. Empirical results on RGB-T benchmarks and generalization tests to scribble-supervised RGB-T, full RGB-D SOD, and RGB-D rail defect detection demonstrate state-of-the-art performance and robust cross-domain transfer, with a lightweight variant for real-time use. The approach offers practical impact by enabling more reliable multi-modal segmentation under challenging conditions and providing a scalable framework for SAM-based RGB-D and RGB-T tasks, supported by detailed ablations and cost analyses.

Abstract

RGB-T salient object detection (SOD) aims to segment attractive objects by combining RGB and thermal infrared images. To enhance performance, the Segment Anything Model has been fine-tuned for this task. However, the imbalance convergence of two modalities and significant gradient difference between high- and low- activations are ignored, thereby leaving room for further performance enhancement. In this paper, we propose a model called \textit{SAMSOD}, which utilizes unimodal supervision to enhance the learning of non-dominant modality and employs gradient deconfliction to reduce the impact of conflicting gradients on model convergence. The method also leverages two decoupled adapters to separately mask high- and low-activation neurons, emphasizing foreground objects by enhancing background learning. Fundamental experiments on RGB-T SOD benchmark datasets and generalizability experiments on scribble supervised RGB-T SOD, fully supervised RGB-D SOD datasets and full-supervised RGB-D rail surface defect detection all demonstrate the effectiveness of our proposed method.

Paper Structure

This paper contains 23 sections, 23 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The framework comparison between traditional methods and ours and their adapters.
  • Figure 2: The pipeline of the proposed SAMSOD.
  • Figure 3: Gradient magnitude ratio between RGB encoder and thermal encoder.
  • Figure 4: The cosine similarity between gradients.
  • Figure 5: High- and low- activation maps of RGB images.
  • ...and 5 more figures