Table of Contents
Fetching ...

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Junliang Xing, Lei Jin, Congyan Lang, Pin Tao

TL;DR

This article presents a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation, which leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training.

Abstract

Cloud segmentation is a critical challenge in remote sensing image interpretation, as its accuracy directly impacts the effectiveness of subsequent data processing and analysis. Recently, vision foundation models (VFM) have demonstrated powerful generalization capabilities across various visual tasks. In this paper, we present a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation. Our method leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training. Cloud-Adapter incorporates a lightweight spatial perception module that initially utilizes a convolutional neural network (ConvNet) to extract dense spatial representations. These multi-scale features are then aggregated and serve as contextual inputs to an adapting module, which modulates the frozen transformer layers within the VFM. Experimental results demonstrate that the Cloud-Adapter approach, utilizing only 0.6% of the trainable parameters of the frozen backbone, achieves substantial performance gains. Cloud-Adapter consistently achieves state-of-the-art performance across various cloud segmentation datasets from multiple satellite sources, sensor series, data processing levels, land cover scenarios, and annotation granularities. We have released the code and model checkpoints at https://xavierjiezou.github.io/Cloud-Adapter/ to support further research.

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

TL;DR

This article presents a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation, which leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training.

Abstract

Cloud segmentation is a critical challenge in remote sensing image interpretation, as its accuracy directly impacts the effectiveness of subsequent data processing and analysis. Recently, vision foundation models (VFM) have demonstrated powerful generalization capabilities across various visual tasks. In this paper, we present a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation. Our method leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training. Cloud-Adapter incorporates a lightweight spatial perception module that initially utilizes a convolutional neural network (ConvNet) to extract dense spatial representations. These multi-scale features are then aggregated and serve as contextual inputs to an adapting module, which modulates the frozen transformer layers within the VFM. Experimental results demonstrate that the Cloud-Adapter approach, utilizing only 0.6% of the trainable parameters of the frozen backbone, achieves substantial performance gains. Cloud-Adapter consistently achieves state-of-the-art performance across various cloud segmentation datasets from multiple satellite sources, sensor series, data processing levels, land cover scenarios, and annotation granularities. We have released the code and model checkpoints at https://xavierjiezou.github.io/Cloud-Adapter/ to support further research.

Paper Structure

This paper contains 33 sections, 9 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Comparison between (a) the Previous Method and (b) our Cloud-Adapter approach. Unlike (a), where the entire network (e.g., a typical encoder-decoder architecture) is fully trainable, resulting in a large number of parameters and an increased risk of overfitting, (b) our Cloud-Adapter approach leverages a frozen vision foundation model (VFM) combined with a lightweight, trainable adapter. This design preserves generalization and adaptability, enabling efficient learning for cloud segmentation tasks. The frozen VFM extracts generic visual features, while the parameter-efficient adapter facilitates effective transfer learning on diverse remote sensing scenes.
  • Figure 2: Detailed network architecture of the proposed Cloud-Adapter method, consisting of the spatial perception and adapting modules. The spatial perception module uses ConvNet blocks to extract dense spatial features, which are aggregated into a multi-scale context and fed to the adapting module. The adapting module modulates the frozen transformer layers in the VFM.
  • Figure 3: Schematic diagram of the proposed adapting module.
  • Figure 4: Ablation study of different dimension settings on the interactive context extracted by the spatial perception module.
  • Figure 5: Study of the low-rank MLP in the adapting module.
  • ...and 4 more figures