Table of Contents
Fetching ...

AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

Yuanhao Tang, Xuechao Zou, Zhengpei Hu, Junliang Xing, Chengkun Zhang, Jianqiang Huang

TL;DR

Remote sensing image scene classification remains challenging due to complex multi-scale structures and the high cost of global-context models. AFM-Net combines a CNN-based local feature extractor with a Vision Mamba global-context backbone in a parallel dual-branch architecture, and fuses their representations through a DenseModel core and DAMF blocks. A Mixture-of-Experts classifier head enables adaptive routing of deeply fused features to specialized experts, improving fine-grained recognition with efficiency. Across UC Merced, AID, and NWPU-RESISC45, AFM-Net achieves state-of-the-art accuracy with favorable computational cost, validating the effectiveness of hierarchical heterogeneous fusion for RSIC. Code is publicly available, and the proposed framework has potential to extend to other remote sensing tasks such as object detection and semantic segmentation.

Abstract

Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.

AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

TL;DR

Remote sensing image scene classification remains challenging due to complex multi-scale structures and the high cost of global-context models. AFM-Net combines a CNN-based local feature extractor with a Vision Mamba global-context backbone in a parallel dual-branch architecture, and fuses their representations through a DenseModel core and DAMF blocks. A Mixture-of-Experts classifier head enables adaptive routing of deeply fused features to specialized experts, improving fine-grained recognition with efficiency. Across UC Merced, AID, and NWPU-RESISC45, AFM-Net achieves state-of-the-art accuracy with favorable computational cost, validating the effectiveness of hierarchical heterogeneous fusion for RSIC. Code is publicly available, and the proposed framework has potential to extend to other remote sensing tasks such as object detection and semantic segmentation.

Abstract

Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.

Paper Structure

This paper contains 32 sections, 9 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Performance versus efficiency. The horizontal axis denotes GFLOPs, the vertical axis denotes OA, and the figure size represents the number of parameters. AFM-Net (red star) achieves the best balance between accuracy and efficiency.
  • Figure 2: Visualizing the Effective Receptive Fields (ERF) of different architectures 2022Scaling2017Understanding. Brighter regions denote greater importance for the final prediction. (a) Transformer: Exhibits a scattered, global ERF by capturing long-range dependencies. (b) CNN: Presents a highly localized ERF, constrained by its local inductive bias. (c) AFM-Net (Ours): Achieves a superior focused-yet-broad ERF. Its CNN branch provides a strong local core, while the Mamba branch efficiently captures structured global context, resulting in a more robust and comprehensive feature representation.
  • Figure 3: The AFM-Net architecture. It synergizes local and global information via a dual-branch design, comprising a CNN backbone for spatial features and a Mambagu2023mamba backbone for long-range dependencies. Features from both branches are refined and progressively integrated at multiple stages by our DenseModel fusion core. A final MoE head performs adaptive classification.
  • Figure 4: Visualization of the three distinct scanning paths within the Mamba branch for sequence processing.
  • Figure 5: The architecture of our proposed DenseModel for multi-scale feature fusion. At each hierarchical stage, features from the CNN and Mamba branches are fused by a DAMF block.
  • ...and 4 more figures