Table of Contents
Fetching ...

Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models

Qionghao Huang, Can Hu

Abstract

Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.

Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models

Abstract

Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.

Paper Structure

This paper contains 38 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Structure of the Survey
  • Figure 2: Methodological taxonomy and developmental timeline of remote sensing scene classification. The diagram traces the evolution from traditional handcrafted feature methods (e.g., BoW, VLAD, spectral/texture descriptors) through deep learning architectures (CNNs, GNNs, Attention Mechanisms, Vision Transformers, Mamba) to large-scale pre-trained models (RSFMs, VLMs) and generative AI approaches (GANs, VAEs, Diffusion Models), as well as hybrid and real-world application strategies. Branch connections indicate how later paradigms were motivated by the limitations of earlier approaches, while timeline annotations mark the approximate periods of prominence for each category.
  • Figure 3: Scale-Free CNN architecture: (a) traditional CNN with fixed input size, (b) SF-CNN enabling arbitrary input sizes through FCL convolution and global average pooling, (c) FCL convolution process xie2019scale.
  • Figure 4: The framework of RS-RADGNN huang2025remote.
  • Figure 5: Vision Transformer architecture for remote sensing scene classification: (a) Overall model architecture showing patch embedding and transformer encoder stack, (b) Transformer encoder module with multi-head self-attention and MLP layers, (c) Multi-head self-attention (MSA) mechanism, (d) Individual self-attention head computation bashmal2021deep.