Table of Contents
Fetching ...

FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

Jinping Wang, Weiwei Song, Hao Chen, Jinchang Ren, Huimin Zhao

TL;DR

FusDreamer introduces a label-efficient remote sensing world model that unifies hyperspectral, LiDAR, and text data within a latent diffusion-based framework. The architecture combines a Latent-spatial Multimodal Generation module, an Open-World Knowledge-guided Consistency Projection module, and a Multitask Combinatorial Optimization scheme to align modalities and leverage open-world prompts for robust classification with scarce labels. Empirical results on four RS datasets show FusDreamer achieving state-of-the-art performance, with particularly strong gains under limited training data and across diverse scenes. The approach offers a principled, generalizable path for integrating multimodal RS data and vision-language knowledge in practical monitoring and analysis tasks.

Abstract

World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy-wang/FusDreamer.

FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

TL;DR

FusDreamer introduces a label-efficient remote sensing world model that unifies hyperspectral, LiDAR, and text data within a latent diffusion-based framework. The architecture combines a Latent-spatial Multimodal Generation module, an Open-World Knowledge-guided Consistency Projection module, and a Multitask Combinatorial Optimization scheme to align modalities and leverage open-world prompts for robust classification with scarce labels. Empirical results on four RS datasets show FusDreamer achieving state-of-the-art performance, with particularly strong gains under limited training data and across diverse scenes. The approach offers a principled, generalizable path for integrating multimodal RS data and vision-language knowledge in practical monitoring and analysis tasks.

Abstract

World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at https://github.com/Cimy-wang/FusDreamer.

Paper Structure

This paper contains 39 sections, 24 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: The architecture of the proposed FusDreamer network, offering a unified representation container for latent feature generation and multimodal feature interaction. The LaMG module uses an interactive latent diffusion for generating learnable latent multimodal features. Then, the OK-CP module generates semantic-aware open-world prompts from physical knowledge attributes, and their pre-trained knowledge is well-suited for training multimodal data with few labeled samples. Finally, the MuCO module constrains the data generation by considering the open-world prompts.
  • Figure 2: An illustration of multimodal data feature learning process.
  • Figure 3: Classification maps obtained by different networks on the Houston 2018 dataset. (a) Ground Truth, (b) MAHiDFNet, (c) AM$^3$Net, (d) NNCNet, (e) CALC, (f) MBFormer, (g) DSHFNet, and (h) FusDreamer.
  • Figure 4: Classification maps obtained by different networks on the MUUFL dataset. (a) Ground Truth, (b) MAHiDFNet, (c) AM$^3$Net, (d) NNCNet, (e) CALC, (f) MBFormer, (g) DSHFNet, and (h) FusDreamer.
  • Figure 5: Classification maps obtained by different networks on the Trento dataset. (a) Ground Truth, (b) MAHiDFNet, (c) AM$^3$Net, (d) NNCNet, (e) CALC, (f) MBFormer, (g) DSHFNet, and (h) FusDreamer.
  • ...and 1 more figures