Table of Contents
Fetching ...

Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone

TL;DR

BDGF tackles modality imbalance in multimodal remote sensing classification by balancing diffusion features during DDPM pre-training through an adaptive masking strategy. It then guides a three-branch network (CNN, Transformer, Mamba) with diffusion features via hierarchical fusion and cross-attention, complemented by a mutual learning module that aligns entropy and feature similarity. The approach achieves state-of-the-art OA/AA/Kappa on Berlin, Augsburg, Yellow River Estuary, and LCZ HK datasets, with demonstrated transferability to cross-model diffusion features. This work highlights diffusion-guided, modality-balanced fusion as a powerful paradigm for robust, scalable multimodal RS classification.

Abstract

Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

TL;DR

BDGF tackles modality imbalance in multimodal remote sensing classification by balancing diffusion features during DDPM pre-training through an adaptive masking strategy. It then guides a three-branch network (CNN, Transformer, Mamba) with diffusion features via hierarchical fusion and cross-attention, complemented by a mutual learning module that aligns entropy and feature similarity. The approach achieves state-of-the-art OA/AA/Kappa on Berlin, Augsburg, Yellow River Estuary, and LCZ HK datasets, with demonstrated transferability to cross-model diffusion features. This work highlights diffusion-guided, modality-balanced fusion as a powerful paradigm for robust, scalable multimodal RS classification.

Abstract

Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

Paper Structure

This paper contains 23 sections, 29 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Comparison of workflows. (a) Previous methods process diffusion and multimodal features separately and then combine them for joint classification. (b) In this work, we exploit global and noise-robust diffusion information to guide the mutual learning of local, sequence-level, and long-range features.
  • Figure 2: Illustration of the proposed BDGF framework.
  • Figure 3: Structure of the adaptive modality masking strategy. In the forward diffusion process, the strategy consists of adding an iteration-varying structure mask and sample mask to the spectral image, while adding noise to the multimodal data.
  • Figure 4: 2D t-SNE embeddings of diffusion feature distribution on the LCZ HK dataset.
  • Figure 5: Flowchart of diffusion features guide CNN-based network.
  • ...and 10 more figures