Table of Contents
Fetching ...

Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

Zhong Ji, Ci Liu, Jingren Liu, Chen Tang, Yanwei Pang, Xuelong Li

TL;DR

This work tackles few-shot remote sensing scene classification by addressing a modality gap between rich visual data and sparse textual cues. It introduces Optimal Transport Adapter Tuning (OTAT), which uses Optimal Transport to create Platonic representations that enable efficient cross-modal information transfer via a novel Optimal Transport Adapter (OTA) and an entropy-aware, sample-level loss (EAW). The approach leverages a frozen CLIP backbone with lightweight adapters and OT-based optimization (via Sinkhorn) to align image and text distributions, augmented by dynamic prototypes and a cosine-based alignment objective. Empirical results on UC Merced, WHU-RS19, NWPU-RESISC45, and AID show state-of-the-art performance in few-shot settings and strong cross-dataset generalization, often surpassing full fine-tuning while remaining computationally efficient. The work introduces a principled pathway for multimodal representation learning in remote sensing with practical benefits for data-scarce scenarios.

Abstract

Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.

Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

TL;DR

This work tackles few-shot remote sensing scene classification by addressing a modality gap between rich visual data and sparse textual cues. It introduces Optimal Transport Adapter Tuning (OTAT), which uses Optimal Transport to create Platonic representations that enable efficient cross-modal information transfer via a novel Optimal Transport Adapter (OTA) and an entropy-aware, sample-level loss (EAW). The approach leverages a frozen CLIP backbone with lightweight adapters and OT-based optimization (via Sinkhorn) to align image and text distributions, augmented by dynamic prototypes and a cosine-based alignment objective. Empirical results on UC Merced, WHU-RS19, NWPU-RESISC45, and AID show state-of-the-art performance in few-shot settings and strong cross-dataset generalization, often surpassing full fine-tuning while remaining computationally efficient. The work introduces a principled pathway for multimodal representation learning in remote sensing with practical benefits for data-scarce scenarios.

Abstract

Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.

Paper Structure

This paper contains 32 sections, 30 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The illustration of our motivation. In FS-RSSC task, the disparity in information density between visual and textual modalities causes traditional optimization algorithms to disproportionately focus on the visual space, resulting in the feature space heavily biased towards visual information. The limited visual data further restricts representation learning, hindering the construction of a robust shared latent space, as illustrated in Figure (a). In contrast, our OTAT framework leverages OT theory to guide the complementation of visual and textual information, thereby creating a unified space that approximates Platonic representations for precise alignment, as shown in Figure (b).
  • Figure 2: Illustration of the proposed Optimal Transport Adapter Tuning (OTAT) framework, comprising: (1) A frozen CLIP model and a trainable OTA structure for multimodal feature extraction; (2) OTO, which leverages visual knowledge to augment textual information and optimizes information transfer between the two modalities; and (3) EAW, integrating adaptive weight adjustment and entropy regularization to derive the optimal OT solver.
  • Figure 3: Implementation details of the OTA structure. In the image encoder, adapter layers are placed parallel to both MHSA and FFN blocks. In the text encoder, adapter layers are positioned parallel to MHSA blocks, with a cross-modal attention mechanism added parallel to the FFN block.
  • Figure 4: The UMAP visualization of generated features under different configurations: (a) Baseline, where the original adapters are optimized with standard cross-entropy loss, (b) OT Optimization (OTO), applying our OT optimization to the original adapters, (c) OTA + OTO, integrating OT optimization into our OTA structure, and (d) OTA + OTO + EAW, further incorporating EAW loss. Dashed lines highlight regions with noticeable class mixing.
  • Figure 5: Impact of module combinations on cross-modal alignment. The MNN metric is used to measure alignment over training.
  • ...and 2 more figures