Table of Contents
Fetching ...

Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

Jasmine Shone, Zhening Li, Shaden Alshammari, Mark Hamilton, William Freeman

TL;DR

Beyond I-Con challenges the default use of KL divergence in representation learning by systematically exploring $f$-divergences as objective functions. By substituting $D_{KL}$ with divergences such as $TV$, $JSD$, and $Hellinger$, the framework demonstrates improved performance across unsupervised clustering, supervised contrastive learning, and dimensionality reduction. Key findings include state-of-the-art clustering with a $TV$-based PMI approach on ViT embeddings, improved CIFAR-10 representations with $JSD$ (and $Hellinger$), and reduced crowding in SNE when using bounded divergences. The work argues that divergence choice is a powerful design knob for loss discovery and representation optimization, with implications for training stability and downstream task performance.

Abstract

The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) supervised contrastive learning with Euclidean distance as the feature space metric is improved by replacing the standard loss function with Jenson-Shannon divergence (JSD); (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded $f$-divergence. Our results highlight the importance of considering divergence choices in representation learning optimization.

Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

TL;DR

Beyond I-Con challenges the default use of KL divergence in representation learning by systematically exploring -divergences as objective functions. By substituting with divergences such as , , and , the framework demonstrates improved performance across unsupervised clustering, supervised contrastive learning, and dimensionality reduction. Key findings include state-of-the-art clustering with a -based PMI approach on ViT embeddings, improved CIFAR-10 representations with (and ), and reduced crowding in SNE when using bounded divergences. The work argues that divergence choice is a powerful design knob for loss discovery and representation optimization, with implications for training stability and downstream task performance.

Abstract

The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) supervised contrastive learning with Euclidean distance as the feature space metric is improved by replacing the standard loss function with Jenson-Shannon divergence (JSD); (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded -divergence. Our results highlight the importance of considering divergence choices in representation learning optimization.

Paper Structure

This paper contains 11 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Results for running SNE on CIFAR-10 using different divergences, after 150 epochs with a CNN model architecture at learning rate 1e-3. Each color represents a class. KL divergence produces highly overlapping categories in the SNE visualization while other divergences achieve separation.
  • Figure 2: Gradient norms for each divergence from running SNE on CIFAR-10 images with a CNN backbone. KL's unbounded nature creates initialization instability that manifests consistently across all network layers, while bounded divergences (TV, Hellinger, JSD) provide more stable gradient behavior throughout training.