Table of Contents
Fetching ...

Multimodal Informative ViT: Information Aggregation and Distribution for Hyperspectral and LiDAR Classification

Jiaqing Zhang, Jie Lei, Weiying Xie, Geng Yang, Daixun Li, Yunsong Li

TL;DR

This work tackles redundancy in multimodal land cover classification by introducing MIViT, a Multimodal Informative ViT that aggregates and distributes information across modalities. It pairs an Alignment Encoder and Oriented Attention Fusion with a Transformer-based global feature extractor, augmented by a Mutual Information–based Information Aggregation Constraint (IAC) and a Self-Distillation–driven Information Distribution Flow (IDF) to learn compact, performance-aware representations. The method supports missing-modality scenarios with lightweight independent classifiers and a reconstruction decoder, achieving strong results across multiple datasets. On Houston2013, MUUFL, and Trento, MIViT yields state-of-the-art accuracy (average OA around 95.6%), demonstrating robust generalization and practical impact for MLCC.

Abstract

In multimodal land cover classification (MLCC), a common challenge is the redundancy in data distribution, where irrelevant information from multiple modalities can hinder the effective integration of their unique features. To tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with an innovative information aggregate-distributing mechanism. This approach redefines redundancy levels and integrates performance-aware elements into the fused representation, facilitating the learning of semantics in both forward and backward directions. MIVit stands out by significantly reducing redundancy in the empirical distribution of each modality's separate and fused features. It employs oriented attention fusion (OAF) for extracting shallow local features across modalities in horizontal and vertical dimensions, and a Transformer feature extractor for extracting deep global features through long-range attention. We also propose an information aggregation constraint (IAC) based on mutual information, designed to remove redundant information and preserve complementary information within embedded features. Additionally, the information distribution flow (IDF) in MIVit enhances performance-awareness by distributing global classification information across different modalities' feature maps. This architecture also addresses missing modality challenges with lightweight independent modality classifiers, reducing the computational load typically associated with Transformers. Our results show that MIVit's bidirectional aggregate-distributing mechanism between modalities is highly effective, achieving an average overall accuracy of 95.56% across three multimodal datasets. This performance surpasses current state-of-the-art methods in MLCC. The code for MIVit is accessible at https://github.com/icey-zhang/MIViT.

Multimodal Informative ViT: Information Aggregation and Distribution for Hyperspectral and LiDAR Classification

TL;DR

This work tackles redundancy in multimodal land cover classification by introducing MIViT, a Multimodal Informative ViT that aggregates and distributes information across modalities. It pairs an Alignment Encoder and Oriented Attention Fusion with a Transformer-based global feature extractor, augmented by a Mutual Information–based Information Aggregation Constraint (IAC) and a Self-Distillation–driven Information Distribution Flow (IDF) to learn compact, performance-aware representations. The method supports missing-modality scenarios with lightweight independent classifiers and a reconstruction decoder, achieving strong results across multiple datasets. On Houston2013, MUUFL, and Trento, MIViT yields state-of-the-art accuracy (average OA around 95.6%), demonstrating robust generalization and practical impact for MLCC.

Abstract

In multimodal land cover classification (MLCC), a common challenge is the redundancy in data distribution, where irrelevant information from multiple modalities can hinder the effective integration of their unique features. To tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with an innovative information aggregate-distributing mechanism. This approach redefines redundancy levels and integrates performance-aware elements into the fused representation, facilitating the learning of semantics in both forward and backward directions. MIVit stands out by significantly reducing redundancy in the empirical distribution of each modality's separate and fused features. It employs oriented attention fusion (OAF) for extracting shallow local features across modalities in horizontal and vertical dimensions, and a Transformer feature extractor for extracting deep global features through long-range attention. We also propose an information aggregation constraint (IAC) based on mutual information, designed to remove redundant information and preserve complementary information within embedded features. Additionally, the information distribution flow (IDF) in MIVit enhances performance-awareness by distributing global classification information across different modalities' feature maps. This architecture also addresses missing modality challenges with lightweight independent modality classifiers, reducing the computational load typically associated with Transformers. Our results show that MIVit's bidirectional aggregate-distributing mechanism between modalities is highly effective, achieving an average overall accuracy of 95.56% across three multimodal datasets. This performance surpasses current state-of-the-art methods in MLCC. The code for MIVit is accessible at https://github.com/icey-zhang/MIViT.
Paper Structure (23 sections, 26 equations, 15 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 26 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Existing MLCC method vs. ours MIViT.
  • Figure 2: The detailed frame of our proposed method. The different modalities are firstly fed into an Alignment Encoder (AE) to capture the aligned separated features ($\Phi_T^1$ and $\Phi_T^2$) from shallow to deep levels. Subsequently, we fuse the separated features to effectively model the complementary information of each mode by Oriented Attention fusion (OAF). To explicitly encourage complementary learning, eliminate information redundancy, and enhance the performance perception capability of multi-classifiers, we impose both the Information Aggregation Constraint (IAC) and Information Distribution Flow (IDF) over the separated and fused representations.
  • Figure 3: The structure of oriented attention fusion (OAF) module.
  • Figure 4: The illustration of complementary information preservation and redundancy information elimination.
  • Figure 5: Visualization of the multimodal redundancy in the existing MLCC method and ours MIViT.
  • ...and 10 more figures