Table of Contents
Fetching ...

DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning

Dino Ienco, Cassio Fraga Dantas

TL;DR

DisCoM-KD addresses cross-modal knowledge distillation under modality mismatch by introducing a disentangled, adversarial framework that extracts per-modality representations (modality-invariant, modality-informative, and modality-irrelevant) and trains all single-modal classifiers simultaneously. By combining disentanglement with gradient-reversal-based domain adaptation, it eliminates the need for a teacher and separate student models, and demonstrates superior performance across three multi-modal benchmarks with both overlapping and non-overlapping modalities. The approach includes a comprehensive loss term setup (L_cl, L_adv, L_mod, L_aux, L_perp) to enforce task relevance, invariance, modality-awareness, and complementary information, with ablations underscoring the importance of auxiliary and disentanglement losses. The findings suggest a shift away from traditional teacher–student paradigms for CMKD toward integrated, multi-representation distillation that is robust to modality configurations and scalable to practical multi-modal learning scenarios.

Abstract

Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.

DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning

TL;DR

DisCoM-KD addresses cross-modal knowledge distillation under modality mismatch by introducing a disentangled, adversarial framework that extracts per-modality representations (modality-invariant, modality-informative, and modality-irrelevant) and trains all single-modal classifiers simultaneously. By combining disentanglement with gradient-reversal-based domain adaptation, it eliminates the need for a teacher and separate student models, and demonstrates superior performance across three multi-modal benchmarks with both overlapping and non-overlapping modalities. The approach includes a comprehensive loss term setup (L_cl, L_adv, L_mod, L_aux, L_perp) to enforce task relevance, invariance, modality-awareness, and complementary information, with ablations underscoring the importance of auxiliary and disentanglement losses. The findings suggest a shift away from traditional teacher–student paradigms for CMKD toward integrated, multi-representation distillation that is robust to modality configurations and scalable to practical multi-modal learning scenarios.

Abstract

Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.
Paper Structure (8 sections, 7 equations, 2 figures, 4 tables)

This paper contains 8 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Schematic overview of DisCoM-KD: On the left, there are two per-modality branch extractors for modalities $M1$ and $M2$, along with two per-modality task classifiers to obtain the final prediction. On the right, several auxiliary classifiers, acting on intermediate representations, help disentangling per-modality information and make representations task informative. The training of the two parallel architectures is performed jointly, but at inference time, each model is deployed independently.
  • Figure 2: Details of the Modality Branch Extractor. It consists of two encoders, one extracting modality-specific ($z_{*}^{irr}$, $z_{*}^{inf}$) and one deriving modality-invariant ($z_{*}^{inv}$) representations. A projection head is used on the output of the modality-invariant encoder to obtain embeddings of the same size as the other representations.