Table of Contents
Fetching ...

Segment Any-Quality Images with Generative Latent Space Enhancement

Guangqian Guo, Yong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao

TL;DR

GleSAM tackles the degradation sensitivity of Segment Anything Models by embedding a pre-trained latent diffusion denoiser into SAM’s latent space to reconstruct high-quality features from low-quality inputs. It introduces two compatibility techniques, Feature Distribution Alignment and Channel Replication and Expansion, plus a two-stage training regime that preserves SAM’s generalization while enhancing latent representations. A new LQSeg dataset with diverse, multi-level degradations supports training and evaluation of robustness across unseen degradations. Across extensive experiments on seen and unseen degradations, including real-world datasets like BDD-100K, GleSAM and GleSAM2 achieve superior segmentation accuracy with minimal additional learnable parameters, demonstrating strong generalization and practical applicability in degraded scenarios.

Abstract

Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

Segment Any-Quality Images with Generative Latent Space Enhancement

TL;DR

GleSAM tackles the degradation sensitivity of Segment Anything Models by embedding a pre-trained latent diffusion denoiser into SAM’s latent space to reconstruct high-quality features from low-quality inputs. It introduces two compatibility techniques, Feature Distribution Alignment and Channel Replication and Expansion, plus a two-stage training regime that preserves SAM’s generalization while enhancing latent representations. A new LQSeg dataset with diverse, multi-level degradations supports training and evaluation of robustness across unseen degradations. Across extensive experiments on seen and unseen degradations, including real-world datasets like BDD-100K, GleSAM and GleSAM2 achieve superior segmentation accuracy with minimal additional learnable parameters, demonstrating strong generalization and practical applicability in degraded scenarios.

Abstract

Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

Paper Structure

This paper contains 35 sections, 7 equations, 15 figures, 13 tables, 2 algorithms.

Figures (15)

  • Figure 1: The comparison of qualitative results on low-quality images with varying degradation levels from an unseen dataset. To generate images with different degradation levels, we progressively added Gaussian Noise, Re-sampling Noise, and more severe Gaussian noise to an image. Results indicate that the baseline SAM sam shows limited robustness to degradation. Although RobustSAM robustsam retains some resilience against simpler degradations, it struggles with more complex and unfamiliar degradations. In contrast, our method consistently demonstrates strong robustness across images of varying quality.
  • Figure 2: The visualization of latent features: (a) low-quality (LQ) images, (b) the SAM's latent features extracted from LQ images, which contain excessive noise and compromise the original representations, (c) the high-quality (HQ) features of the corresponding clear images, which are more salient than LQ ones, and (d) enhanced representation by our GleSAM.
  • Figure 3: Given an input image, GleSAM performs accurate segmentation through image encoding, generative latent space enhancement, and mask decoding. During training, HQ-LQ image pairs are fed into the frozen image encoder to extract the corresponding HQ and LQ latent features. We then reconstruct high-quality representations in the SAM's latent space by efficiently fine-tuning a generative denoising U-Net with LoRA. Subsequently, the decoder is fine-tuned with segmentation loss to align the enhanced latent representations. Built upon SAMs, GleSAM inherits prompt-based segmentation and performs well on images of any quality.
  • Figure 4: Density distribution maps about IoU and image quality across different methods, including SAM, GleSAM, SAM2, and GleSAM2. The image quality is calculated using the Laplacian operator in OpenCV. The red dashed box highlights the area where our method demonstrates improved segmentation performance compared to SAM, particularly in lower-quality images.
  • Figure 5: Qualitative visualization of the enhanced latent features. The clearest feature is obtained when combining all modules.
  • ...and 10 more figures