From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation
Quanwei Liu, Tao Huang, Yanni Dong, Jiaqi Yang, Wei Xiang
TL;DR
The paper provides a comprehensive synthesis of deep learning advances in remote sensing image semantic segmentation, tracing the evolution from pixel- and patch-based methods to tile- and image-level approaches, and highlighting the rising importance of multimodal fusion. It introduces a unified evaluation framework via the MOSD dataset, benchmarking nearly 40 methods to reveal performance- and efficiency-related trade-offs across unimodal and multimodal settings. Key contributions include a proposed taxonomy bridging patchwise and tilewise paradigms, analysis of linear vs nonlinear fusion, and a discussion of communication-ready learning strategies (self-supervised, semi-supervised, domain adaptation, DG, and vision-language learning). The work underscores open challenges and suggests future directions—data expansion, foundation-model adoption, and robust cross-domain learning—that can enhance RSISS practicality for large-scale, diverse Earth observation tasks.
Abstract
Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.
