Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

Junjie Zhang; Feng Zhao; Hanqiang Liu; Jun Yu

Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu

TL;DR

RS image classification across diverse modalities and unseen scenes suffers from cross-domain heterogeneity. We formalize RSMG and introduce FVMGN, which minimizes ${\rm error}_{\mathcal{T}} = \min_{f}\sum_{k=1}^{K}\mathbb{E}_{(x,y)\in T_k}[\mathcal{L}(f(x),y)]$, and integrates diffusion-based data augmentation (DTAug), frequency-domain disentanglement (MWDis), spatial-frequency feature extraction (SFIE), and multiscale cross-modal alignment (MSFFA) with modality-specific textual priors. The approach leverages a transformer-based text encoder for shared/proprietary texts, wavelet-enabled vision pathways, and a multiscale alignment objective to unify spatial and frequency representations. Experiments on MUUFL, Trento, and HU2013 show that FVMGN achieves state-of-the-art cross-domain generalization for RS multimodal data and that each architectural component contributes to performance gains, underscoring the method's practical value for robust RS multimodality analysis.

Abstract

The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.

Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

TL;DR

Abstract

Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)