Table of Contents
Fetching ...

Breaking the Geometric Bottleneck: Contrastive Expansion in Asymmetric Cross-Modal Distillation

Kabir Thayani

TL;DR

A critical capacity-density trade-off is revealed: overparameterization within fixed manifolds induces brittleness, while capacity-constrained models act as optimal low-pass semantic filters, successfully recovering inherent noise immunity.

Abstract

Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling global Vision Transformers (CLIP and DINOv2) into capacity-constrained CNNs. By employing strictly centered SVD and Effective Rank, we first demonstrate a capacity-agnostic phase transition on CIFAR-10 where standard cosine distillation collapses representations to an intrinsic Effective Rank of ~16. To reverse this, we integrate an auxiliary contrastive objective (InfoNCE), expanding the student's manifold by 2.4x (to ~38 effective dimensions). We further demonstrate that while DINOv2's uniform geometry partially prevents collapse, contrastive expansion remains a universal requirement to reach the CNN's topological capacity limit (~82 dimensions). Finally, we reveal a critical capacity-density trade-off: overparameterization within fixed manifolds induces brittleness, while capacity-constrained models act as optimal low-pass semantic filters, successfully recovering inherent noise immunity.

Breaking the Geometric Bottleneck: Contrastive Expansion in Asymmetric Cross-Modal Distillation

TL;DR

A critical capacity-density trade-off is revealed: overparameterization within fixed manifolds induces brittleness, while capacity-constrained models act as optimal low-pass semantic filters, successfully recovering inherent noise immunity.

Abstract

Knowledge distillation between asymmetric architectures often induces severe geometric constraints on the learned representation space. In this work, we investigate the Dimensional Collapse phenomenon when distilling global Vision Transformers (CLIP and DINOv2) into capacity-constrained CNNs. By employing strictly centered SVD and Effective Rank, we first demonstrate a capacity-agnostic phase transition on CIFAR-10 where standard cosine distillation collapses representations to an intrinsic Effective Rank of ~16. To reverse this, we integrate an auxiliary contrastive objective (InfoNCE), expanding the student's manifold by 2.4x (to ~38 effective dimensions). We further demonstrate that while DINOv2's uniform geometry partially prevents collapse, contrastive expansion remains a universal requirement to reach the CNN's topological capacity limit (~82 dimensions). Finally, we reveal a critical capacity-density trade-off: overparameterization within fixed manifolds induces brittleness, while capacity-constrained models act as optimal low-pass semantic filters, successfully recovering inherent noise immunity.
Paper Structure (12 sections, 3 equations, 3 figures, 2 tables)

This paper contains 12 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Singular Value Spectrum (Log Scale). The InfoNCE student (green) successfully tracks the Teacher's geometric shape (black).
  • Figure 2: Effective Rank Expansion (DINOv2 Teacher on CIFAR-100).
  • Figure 3: High-Frequency Noise Robustness (CIFAR-100).