Table of Contents
Fetching ...

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

Riling Wei, Kelu Yao, Chuanguang Yang, Jin Wang, Zhuoyan Gao, Chao Li

TL;DR

This work introduces Asymmetric Cross-modal Knowledge Distillation (ACKD) to enable knowledge transfer between modalities with weak semantic overlap, addressing limitations of traditional Symmetric Cross-modal KD in scenarios with unpaired data. The authors propose SemBridge, a plug-in framework comprising Student-Friendly Matching (SFM) to reduce transport costs via dynamic teacher selection, and Semantic-aware Knowledge Alignment (SKA) to optimize intra- and cross-modal transport with an optimal transport planner and CORAL-based alignment. They formalize transport costs with Wasserstein distance and intra-modal planning, implement retrieval galleries and a self-supervised semantic matcher, and demonstrate that SemBridge achieves state-of-the-art performance across six model architectures on three remote-sensing datasets, while also improving existing SCKD methods under ACKD. The dataset benchmark and method show strong practical impact for scalable, cross-modal remote sensing applications, though training speed remains a noted limitation for the SFM component.

Abstract

Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

TL;DR

This work introduces Asymmetric Cross-modal Knowledge Distillation (ACKD) to enable knowledge transfer between modalities with weak semantic overlap, addressing limitations of traditional Symmetric Cross-modal KD in scenarios with unpaired data. The authors propose SemBridge, a plug-in framework comprising Student-Friendly Matching (SFM) to reduce transport costs via dynamic teacher selection, and Semantic-aware Knowledge Alignment (SKA) to optimize intra- and cross-modal transport with an optimal transport planner and CORAL-based alignment. They formalize transport costs with Wasserstein distance and intra-modal planning, implement retrieval galleries and a self-supervised semantic matcher, and demonstrate that SemBridge achieves state-of-the-art performance across six model architectures on three remote-sensing datasets, while also improving existing SCKD methods under ACKD. The dataset benchmark and method show strong practical impact for scalable, cross-modal remote sensing applications, though training speed remains a noted limitation for the SFM component.

Abstract

Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

Paper Structure

This paper contains 25 sections, 34 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (a) SCKD distills knowledge between modalities from the same location, assuming strict semantic alignment. In contrast, ACKD relaxes this constraint, enabling cross-modal transfer with only weak semantic consistency, regardless of location. This allows a small MS dataset to benefit a larger RGB set. (b) The proposed SemBridge further boosts the performance of SCKD approaches (DKD, RKD, Vanilla KD, LSKD, CTKD, and Logits) under ACKD settings.
  • Figure 2: Wasserstein distance between ACKD and SCKD on three datasets. ACKD consistently incurs higher transport costs than SCKD during training, reflecting the challenge of cross-modal alignment in asymmetric settings.
  • Figure 3: Illustration of the proposed student friendly matching strategy consisting of SSM for the first matching and Dyn. M allowing student select proper teacher samples dynamically at the training stage. In DynM, current student model is designed to involved.
  • Figure 4: The structure of Planner, which is used to finalize the optimal transport cost.
  • Figure 5: Hyperparameter analysis.
  • ...and 2 more figures