A Generalization Theory of Cross-Modality Distillation with Contrastive Learning
Hangyu Lin, Chen Liu, Chengming Xu, Zhengqi Gao, Yanwei Fu, Yuan Yao
TL;DR
The paper addresses cross-modality distillation under limited labeled data by proposing a generalizable Cross-Modality Contrastive Distillation (CMCD) framework that leverages both positive and negative correspondences via contrastive learning. It introduces two losses, CMD and CMC, within a three-step pipeline: 1) self-supervised contrastive learning on a rich source modality, 2) cross-modality distillation to align a target modality using CMD/CMC, 3) downstream fine-tuning on the target with scarce labels. Theoretical analysis derives convergence and a generalization bound showing the target task error is tied to the total variation distance between source and target latent distributions, supported by Rademacher complexities; experiments across image, sketch, depth, video, and audio demonstrate 2–3% gains over strong baselines. The work highlights practical impact for memory/privacy-constrained settings and demonstrates broad applicability across diverse modalities and tasks.
Abstract
Cross-modality distillation arises as an important topic for data modalities containing limited knowledge such as depth maps and high-quality sketches. Such techniques are of great importance, especially for memory and privacy-restricted scenarios where labeled training data is generally unavailable. To solve the problem, existing label-free methods leverage a few pairwise unlabeled data to distill the knowledge by aligning features or statistics between the source and target modalities. For instance, one typically aims to minimize the L2 distance or contrastive loss between the learned features of pairs of samples in the source (e.g. image) and the target (e.g. sketch) modalities. However, most algorithms in this domain only focus on the experimental results but lack theoretical insight. To bridge the gap between the theory and practical method of cross-modality distillation, we first formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning that leverages both positive and negative correspondence, towards a better distillation of generalizable features. Furthermore, we establish a thorough convergence analysis that reveals that the distance between source and target modalities significantly impacts the test error on downstream tasks within the target modality which is also validated by the empirical results. Extensive experimental results show that our algorithm outperforms existing algorithms consistently by a margin of 2-3\% across diverse modalities and tasks, covering modalities of image, sketch, depth map, and audio and tasks of recognition and segmentation.
