Table of Contents
Fetching ...

Cross-Modal Distillation For Widely Differing Modalities

Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu

TL;DR

This work tackles the challenge of transferring knowledge across widely differing modalities to improve uni-modal deployment when multi-modal data is not available at inference. It introduces a trainable projection head on the teacher, along with two soft constraint distillation strategies (feature-level and classifier-level) and a quality-aware weighting module to prevent overfitting and adapt to data quality. The approach is validated on speaker recognition and image classification, demonstrating notable gains over baselines and showing cross-modal matching capabilities, even when the teacher is not strictly superior. Key findings include the superiority of soft constraints over hard alignment for cross-modal transfer, the usefulness of a projection head to bridge modality gaps, and the robustness conferred by sample-quality weighting across datasets and tasks.

Abstract

Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

Cross-Modal Distillation For Widely Differing Modalities

TL;DR

This work tackles the challenge of transferring knowledge across widely differing modalities to improve uni-modal deployment when multi-modal data is not available at inference. It introduces a trainable projection head on the teacher, along with two soft constraint distillation strategies (feature-level and classifier-level) and a quality-aware weighting module to prevent overfitting and adapt to data quality. The approach is validated on speaker recognition and image classification, demonstrating notable gains over baselines and showing cross-modal matching capabilities, even when the teacher is not strictly superior. Key findings include the superiority of soft constraints over hard alignment for cross-modal transfer, the usefulness of a projection head to bridge modality gaps, and the robustness conferred by sample-quality weighting across datasets and tasks.

Abstract

Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

Paper Structure

This paper contains 22 sections, 13 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: The effect of different model sizes and different modalities on the performance of identity recognition tasks. ResNet34 with different numbers (1/4, 1/2 and full) of convolutional channels are trained on the VoxCeleb2 dataset using audio and image as inputs, respectively.
  • Figure 2: Illustration of cross-modal knowledge transfer. The strong modality (e.g. image) transfers the discriminative knowledge to the weak modality (e.g. speech) during training. During test, the weak modality only is used.
  • Figure 3: The overview of our cross-modal distillation framework where the teacher modality transfers discriminative knowledge to the student modality. The projection head extracts modality-share features from the teacher model to reduce the modal gap. Soft constrained knowledge distillation sets distillation constraints from the feature level and classifier level, which can avoid overfitting while transferring knowledge. The quality module can weigh the training samples based on the quantified data quality, leading to robust training.
  • Figure 4: Knowledge transfer can lead to overfitting if modality-specific information is transferred.
  • Figure 5: Illustration of the classifier level distillation. Different shapes indicate different categories and dash lines indicate decision boundary.
  • ...and 4 more figures