Table of Contents
Fetching ...

Multi-level Supervised Contrastive Learning

Naghmeh Ghanooni, Barbod Pajoum, Harshit Rawal, Sophie Fellenz, Vo Nguyen Le Duy, Marius Kloft

TL;DR

MLCL introduces a unified, end-to-end framework that extends supervised contrastive learning by incorporating multiple projection heads to capture diverse similarity notions across hierarchical levels and multi-label dimensions. Each head learns a distinct representation facet, with head-specific temperatures guiding hard-negative emphasis and level-wise separation, leading to improved sample efficiency and robustness. Empirical results across image and text datasets show MLCL achieves competitive or superior performance to state-of-the-art methods, especially with limited training data, while offering faster convergence and regularization benefits. The approach generalizes existing contrastive learning without architectural commitments, enabling broad applicability to downstream tasks requiring nuanced similarity modeling.

Abstract

Contrastive learning is a well-established paradigm in representation learning. The standard framework of contrastive learning minimizes the distance between "similar" instances and maximizes the distance between dissimilar ones in the projection space, disregarding the various aspects of similarity that can exist between two samples. Current methods rely on a single projection head, which fails to capture the full complexity of different aspects of a sample, leading to suboptimal performance, especially in scenarios with limited training data. In this paper, we present a novel supervised contrastive learning method in a unified framework called multilevel contrastive learning (MLCL), that can be applied to both multi-label and hierarchical classification tasks. The key strength of the proposed method is the ability to capture similarities between samples across different labels and/or hierarchies using multiple projection heads. Extensive experiments on text and image datasets demonstrate that the proposed approach outperforms state-of-the-art contrastive learning methods

Multi-level Supervised Contrastive Learning

TL;DR

MLCL introduces a unified, end-to-end framework that extends supervised contrastive learning by incorporating multiple projection heads to capture diverse similarity notions across hierarchical levels and multi-label dimensions. Each head learns a distinct representation facet, with head-specific temperatures guiding hard-negative emphasis and level-wise separation, leading to improved sample efficiency and robustness. Empirical results across image and text datasets show MLCL achieves competitive or superior performance to state-of-the-art methods, especially with limited training data, while offering faster convergence and regularization benefits. The approach generalizes existing contrastive learning without architectural commitments, enabling broad applicability to downstream tasks requiring nuanced similarity modeling.

Abstract

Contrastive learning is a well-established paradigm in representation learning. The standard framework of contrastive learning minimizes the distance between "similar" instances and maximizes the distance between dissimilar ones in the projection space, disregarding the various aspects of similarity that can exist between two samples. Current methods rely on a single projection head, which fails to capture the full complexity of different aspects of a sample, leading to suboptimal performance, especially in scenarios with limited training data. In this paper, we present a novel supervised contrastive learning method in a unified framework called multilevel contrastive learning (MLCL), that can be applied to both multi-label and hierarchical classification tasks. The key strength of the proposed method is the ability to capture similarities between samples across different labels and/or hierarchies using multiple projection heads. Extensive experiments on text and image datasets demonstrate that the proposed approach outperforms state-of-the-art contrastive learning methods

Paper Structure

This paper contains 28 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An illustration of two projection spaces using different examples: a) TripAdvisor reviews and b) CIFAR-100. Similarity can be defined across various dimensions, and a single projection space is insufficient to capture the full spectrum of feature levels.
  • Figure 2: Our proposed architecture with multiple projection heads for a) hierarchical classification and b) multi-label classification. The final loss is computed as a weighted sum of the losses from each head, with an additional cross-entropy loss applied in the case of text classification.
  • Figure 3: The t-SNE visualization of samples from five different classes in the representation space. The SupCon representation indicates a lack of meaningful structure, as samples from class camel are positioned far from chimpanzee, kangaroo, and cattle compared to the bottle class. In contrast, our proposed method, MLCL, groups animal classes more closely in the representation space.
  • Figure 4: Accuracy of MLCL as a function of the superclass projection head temperature. Increasing the temperature to 0.5 reduces the network's focus on hard negatives, leading to improved accuracy.
  • Figure 5: SupCon positions samples from the cattle class close to pear and sweet pepper. In contrast, MLCL clusters the fruit and vegetable classes more closely in the representation space while separating them from the cattle class.
  • ...and 3 more figures