Table of Contents
Fetching ...

From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation

Quanwei Liu, Tao Huang, Yanni Dong, Jiaqi Yang, Wei Xiang

TL;DR

The paper provides a comprehensive synthesis of deep learning advances in remote sensing image semantic segmentation, tracing the evolution from pixel- and patch-based methods to tile- and image-level approaches, and highlighting the rising importance of multimodal fusion. It introduces a unified evaluation framework via the MOSD dataset, benchmarking nearly 40 methods to reveal performance- and efficiency-related trade-offs across unimodal and multimodal settings. Key contributions include a proposed taxonomy bridging patchwise and tilewise paradigms, analysis of linear vs nonlinear fusion, and a discussion of communication-ready learning strategies (self-supervised, semi-supervised, domain adaptation, DG, and vision-language learning). The work underscores open challenges and suggests future directions—data expansion, foundation-model adoption, and robust cross-domain learning—that can enhance RSISS practicality for large-scale, diverse Earth observation tasks.

Abstract

Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.

From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation

TL;DR

The paper provides a comprehensive synthesis of deep learning advances in remote sensing image semantic segmentation, tracing the evolution from pixel- and patch-based methods to tile- and image-level approaches, and highlighting the rising importance of multimodal fusion. It introduces a unified evaluation framework via the MOSD dataset, benchmarking nearly 40 methods to reveal performance- and efficiency-related trade-offs across unimodal and multimodal settings. Key contributions include a proposed taxonomy bridging patchwise and tilewise paradigms, analysis of linear vs nonlinear fusion, and a discussion of communication-ready learning strategies (self-supervised, semi-supervised, domain adaptation, DG, and vision-language learning). The work underscores open challenges and suggests future directions—data expansion, foundation-model adoption, and robust cross-domain learning—that can enhance RSISS practicality for large-scale, diverse Earth observation tasks.

Abstract

Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.

Paper Structure

This paper contains 63 sections, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Processing flow for RSISS. $\downharpoonright \upharpoonleft$ denotes the feature interaction.
  • Figure 2: Configurable architecture for DL models. DL models can be categorised into four classes following a layer–block–network–architecture framework. New model architectures rely on permutations of existing modules or the development of new base modules.
  • Figure 3: Comparison of trends in the accuracy of different segmentation strategies in this survey.
  • Figure 4: Pixel-based, patch-based, tile-based, and image-based RSISS illustrations. The patch-based and tile-based SS frameworks are attached below.
  • Figure 5: Illustration of the data processing, model, structure, supervision, fusion, and feature extraction approaches used for RSISS.
  • ...and 9 more figures