Table of Contents
Fetching ...

RobustSAM: Segment Anything Robustly on Degraded Images

Wei-Ting Chen, Yu-Jiet Vong, Sy-Yen Kuo, Sizhuo Ma, Jian Wang

TL;DR

RobustSAM addresses the degraded-image performance gap of SAM by adding lightweight, degradation-robustification modules that preserve zero-shot capabilities. It introduces Anti-Degradation Mask Feature Generation (AMFG), Anti-Degradation Output Token Generation (AOTG), and a Robust Output Token (ROT), trained with degradation-augmented data and consistency losses to align degraded outputs with clean references. A large Robust-Seg dataset (688K image-mask pairs across 15 synthetic degradations) supports training and evaluation, enabling robust zero-shot segmentation across diverse conditions. Experiments show RobustSAM not only improves segmentation under degradation but also enhances SAM-based downstream tasks such as dehazing and deblurring, offering a practical, efficient path to real-world deployment.

Abstract

Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs, demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset, a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance, especially under zero-shot conditions, underscoring its potential for extensive real-world application. Additionally, our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring.

RobustSAM: Segment Anything Robustly on Degraded Images

TL;DR

RobustSAM addresses the degraded-image performance gap of SAM by adding lightweight, degradation-robustification modules that preserve zero-shot capabilities. It introduces Anti-Degradation Mask Feature Generation (AMFG), Anti-Degradation Output Token Generation (AOTG), and a Robust Output Token (ROT), trained with degradation-augmented data and consistency losses to align degraded outputs with clean references. A large Robust-Seg dataset (688K image-mask pairs across 15 synthetic degradations) supports training and evaluation, enabling robust zero-shot segmentation across diverse conditions. Experiments show RobustSAM not only improves segmentation under degradation but also enhances SAM-based downstream tasks such as dehazing and deblurring, offering a practical, efficient path to real-world deployment.

Abstract

Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs, demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset, a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance, especially under zero-shot conditions, underscoring its potential for extensive real-world application. Additionally, our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring.
Paper Structure (35 sections, 8 equations, 12 figures, 15 tables)

This paper contains 35 sections, 8 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Overview of our proposed RobustSAM. RobustSAM augments the original SAM by incorporating five essential components (in purple). During training, clear images are fed through the original SAM modules (in gray) to produce features for clear scenes. Subsequently, degraded images, generated through augmentation of clear inputs, are processed by RobustSAM, yielding features for degraded scenarios. These are then refined via Anti-degradation modules, ensuring consistency with features from clear scenes. This methodology, supported by a segmentation loss, achieves precise segmentation outcomes in both clear and degraded image conditions. During inference, only RobustSAM is used to predict a segmentation mask from an input image. Note: The prompt encoder is excluded for conciseness, and the padlock icons represent fixed components loaded from the original SAM model that are not updated during training.
  • Figure 2: Overview of the proposed Anti-degradation Mask Feature Generation (AMFG) and Anti-degradation Output Token Generation (AOTG). SEC denotes Squeeze-and-Excitation Channel attention.
  • Figure 3: Qualitative Analysis of Segmentation: A visual comparison on unseen datasets highlighting the performance improvements of the RobustSAM over existing strategies.
  • Figure 4: Enhancing SAM-based applications: A qualitative demonstration of RobustSAM's superiority in refining SAM-based single image dehazing and deblurring.
  • Figure S.1: Comparison of performance, speed, and model size among various SAM and RobustSAM variants. The suffixes -B, -L, and -H correspond to ViT-B (Base), ViT-L (Large), and ViT-H (Huge) versions, respectively, representing different scales and complexities of the Vision Transformer architecture.
  • ...and 7 more figures