Table of Contents
Fetching ...

A Generalization Theory of Cross-Modality Distillation with Contrastive Learning

Hangyu Lin, Chen Liu, Chengming Xu, Zhengqi Gao, Yanwei Fu, Yuan Yao

TL;DR

The paper addresses cross-modality distillation under limited labeled data by proposing a generalizable Cross-Modality Contrastive Distillation (CMCD) framework that leverages both positive and negative correspondences via contrastive learning. It introduces two losses, CMD and CMC, within a three-step pipeline: 1) self-supervised contrastive learning on a rich source modality, 2) cross-modality distillation to align a target modality using CMD/CMC, 3) downstream fine-tuning on the target with scarce labels. Theoretical analysis derives convergence and a generalization bound showing the target task error is tied to the total variation distance between source and target latent distributions, supported by Rademacher complexities; experiments across image, sketch, depth, video, and audio demonstrate 2–3% gains over strong baselines. The work highlights practical impact for memory/privacy-constrained settings and demonstrates broad applicability across diverse modalities and tasks.

Abstract

Cross-modality distillation arises as an important topic for data modalities containing limited knowledge such as depth maps and high-quality sketches. Such techniques are of great importance, especially for memory and privacy-restricted scenarios where labeled training data is generally unavailable. To solve the problem, existing label-free methods leverage a few pairwise unlabeled data to distill the knowledge by aligning features or statistics between the source and target modalities. For instance, one typically aims to minimize the L2 distance or contrastive loss between the learned features of pairs of samples in the source (e.g. image) and the target (e.g. sketch) modalities. However, most algorithms in this domain only focus on the experimental results but lack theoretical insight. To bridge the gap between the theory and practical method of cross-modality distillation, we first formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning that leverages both positive and negative correspondence, towards a better distillation of generalizable features. Furthermore, we establish a thorough convergence analysis that reveals that the distance between source and target modalities significantly impacts the test error on downstream tasks within the target modality which is also validated by the empirical results. Extensive experimental results show that our algorithm outperforms existing algorithms consistently by a margin of 2-3\% across diverse modalities and tasks, covering modalities of image, sketch, depth map, and audio and tasks of recognition and segmentation.

A Generalization Theory of Cross-Modality Distillation with Contrastive Learning

TL;DR

The paper addresses cross-modality distillation under limited labeled data by proposing a generalizable Cross-Modality Contrastive Distillation (CMCD) framework that leverages both positive and negative correspondences via contrastive learning. It introduces two losses, CMD and CMC, within a three-step pipeline: 1) self-supervised contrastive learning on a rich source modality, 2) cross-modality distillation to align a target modality using CMD/CMC, 3) downstream fine-tuning on the target with scarce labels. Theoretical analysis derives convergence and a generalization bound showing the target task error is tied to the total variation distance between source and target latent distributions, supported by Rademacher complexities; experiments across image, sketch, depth, video, and audio demonstrate 2–3% gains over strong baselines. The work highlights practical impact for memory/privacy-constrained settings and demonstrates broad applicability across diverse modalities and tasks.

Abstract

Cross-modality distillation arises as an important topic for data modalities containing limited knowledge such as depth maps and high-quality sketches. Such techniques are of great importance, especially for memory and privacy-restricted scenarios where labeled training data is generally unavailable. To solve the problem, existing label-free methods leverage a few pairwise unlabeled data to distill the knowledge by aligning features or statistics between the source and target modalities. For instance, one typically aims to minimize the L2 distance or contrastive loss between the learned features of pairs of samples in the source (e.g. image) and the target (e.g. sketch) modalities. However, most algorithms in this domain only focus on the experimental results but lack theoretical insight. To bridge the gap between the theory and practical method of cross-modality distillation, we first formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning that leverages both positive and negative correspondence, towards a better distillation of generalizable features. Furthermore, we establish a thorough convergence analysis that reveals that the distance between source and target modalities significantly impacts the test error on downstream tasks within the target modality which is also validated by the empirical results. Extensive experimental results show that our algorithm outperforms existing algorithms consistently by a margin of 2-3\% across diverse modalities and tasks, covering modalities of image, sketch, depth map, and audio and tasks of recognition and segmentation.
Paper Structure (18 sections, 4 theorems, 51 equations, 1 figure, 8 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 51 equations, 1 figure, 8 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $\hat{\phi}_{\mathcal{A}}$ the minimizer of equation eq:step1_1. Then, with probability at least $1 - \delta$, we have, where $\mathcal{P}_{\mathcal{X}_{\mathcal{A}} \times \mathcal{S}}(\Phi_{\mathcal{A}}) = \{\mathbb{P}_{\phi_{\mathcal{A}}}(\bm{x}, s)|\phi_{\mathcal{A}} \in \Phi_{\mathcal{A}}\}$, $\bm{x} = (x_i, x_j)$, $s$ indicates whether $x_i, x_j$ is the paired data, and $N_{[]}(\cdot,

Figures (1)

  • Figure 1: The results of image-sketch tasks with different numbers of distilling samples. The $m / M$ means the percentage of numbers used in distillation. We report the top-1 accuracy of downstream classification on the Sketchy and TUBerlin.

Theorems & Definitions (9)

  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • proof
  • proof
  • Lemma A.1: Bound of ERM.
  • proof
  • proof
  • proof