AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification
Yuanhao Tang, Xuechao Zou, Zhengpei Hu, Junliang Xing, Chengkun Zhang, Jianqiang Huang
TL;DR
Remote sensing image scene classification remains challenging due to complex multi-scale structures and the high cost of global-context models. AFM-Net combines a CNN-based local feature extractor with a Vision Mamba global-context backbone in a parallel dual-branch architecture, and fuses their representations through a DenseModel core and DAMF blocks. A Mixture-of-Experts classifier head enables adaptive routing of deeply fused features to specialized experts, improving fine-grained recognition with efficiency. Across UC Merced, AID, and NWPU-RESISC45, AFM-Net achieves state-of-the-art accuracy with favorable computational cost, validating the effectiveness of hierarchical heterogeneous fusion for RSIC. Code is publicly available, and the proposed framework has potential to extend to other remote sensing tasks such as object detection and semantic segmentation.
Abstract
Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.
