Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

Yilong Chen; Zongyi Xu; Xiaoshui Huang; Shanshan Zhao; Xinqi Jiang; Xinyu Gao; Xinbo Gao

Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

Yilong Chen, Zongyi Xu, Xiaoshui Huang, Shanshan Zhao, Xinqi Jiang, Xinyu Gao, Xinbo Gao

TL;DR

This work investigates cross-modal knowledge distillation under domain gaps, introducing the Non-target Divergence Hypothesis (NTDH), which posits that the divergence of non-target class distributions governs KD effectiveness across modalities. The authors provide a VC-theory based analysis to derive upper and lower bounds on the cross-modal KD error and validate NTDH through extensive experiments across five multimodal datasets, complemented by a practical masking framework to reduce non-target divergence. They show that aligning non-target distributions improves cross-modal KD performance, and that a masking strategy can generalize to existing KD methods, offering tangible guidance for multimodal knowledge transfer. Overall, NTDH offers a principled explanation for domain-gap effects in cross-modal KD and introduces practical tools for improving distillation in multimodal settings.

Abstract

Compared to single-modal knowledge distillation, cross-modal knowledge distillation faces more severe challenges due to domain gaps between modalities. Although various methods have proposed various solutions to overcome these challenges, there is still limited research on how domain gaps affect cross-modal knowledge distillation. This paper provides an in-depth analysis and evaluation of this issue. We first introduce the Non-Target Divergence Hypothesis (NTDH) to reveal the impact of domain gaps on cross-modal knowledge distillation. Our key finding is that domain gaps between modalities lead to distribution differences in non-target classes, and the smaller these differences, the better the performance of cross-modal knowledge distillation. Subsequently, based on Vapnik-Chervonenkis (VC) theory, we derive the upper and lower bounds of the approximation error for cross-modal knowledge distillation, thereby theoretically validating the NTDH. Finally, experiments on five cross-modal datasets further confirm the validity, generalisability, and applicability of the NTDH.

Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

TL;DR

Abstract

Paper Structure (30 sections, 27 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 30 sections, 27 equations, 13 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Unimodal KD
Crossmodal KD
Domain Gaps in Cross-modal KD
The Proposed Hypothesis
Symbol Definitions And Conditional Assumptions
Non-target Divergence Hypothesis
Prove the Non-target Divergence Hypothesis
Experiments
Datasets
Scikit-learn
MNIST/MNIST-M
RAVDESS
SemanticKITTI
...and 15 more sections

Figures (13)

Figure 1: Some cross-modal data instances. (a) Modal misaligned scene. (b)-(d) Modal alignment scenario (Our research subjects), such as images and audio of a guitar, RGB images from the same camera perspective and point cloud projected onto images and depth maps.
Figure 2: Venn diagram example for NTDH. The prediction distributions of the teacher and student models consist of target class and non-target class distributions. For example, in the prediction of road areas, the overlapping region in the Venn diagram represents the target class prediction distribution, while the non-overlapping region represents the non-target class prediction distribution.
Figure 3: An illustration of NTDH with synthetic Scikit-learn data. As the domain discrepancy of multimodal data increases ($\gamma$ ranges from 0 to 1), the performance of KD gradually declines. At the same time, the discrepancy in non-target class prediction distributions increases, and its growth rate far exceeds that of the target class distribution discrepancy.
Figure 4: Data Processing. (a) Scikit-learn Data: With the student modality ${{x}^{b}}$ fixed, the teacher modality ${{x}^{a}}$ is adjusted by changing the parameter $\gamma$, which ranges from 0 to 1. A larger $\gamma$ indicates a greater domain difference between the modalities. (b) MNIST/MNIST-M: Random noise is added to the MNIST and MNIST-M datasets to test the robustness of the model. (c) SemanticKITTI: 3D point cloud data is projected into the camera coordinate system to produce 2D point clouds, and dense image segmentation labels are obtained as described in chen2024foundation.
Figure 5: The experimental plan for NTDH. The first part involves adjusting weight coefficients in the loss function to assess the impact of distribution differences among non-target classes on KD. The second part applies a masking method to selectively remove features or samples with significant non-target distribution differences, evaluating the subsequent effect on distillation performance.
...and 8 more figures

Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

TL;DR

Abstract

Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)