Table of Contents
Fetching ...

Learning Label Hierarchy with Supervised Contrastive Learning

Ruixue Lian, William A. Sethares, Junjie Hu

TL;DR

This work addresses the limitation of standard supervised contrastive learning (SCL) which treats all classes as equally related by introducing LASCL, a framework that leverages label hierarchies to shape the embedding space. LASCL constructs learnable label representations from hierarchical label descriptions, computes class similarities, and uses them to scale the SCL objective, while adding an instance-centering term with both unweighted and weighted variants. The authors propose four LASCL variants (LI, LIUC, LIC, LISC) and demonstrate that incorporating hierarchical information yields improved intra-cluster compactness and inter-cluster separation across three text classification datasets, with the learned label centers usable as a direct nearest-neighbor classifier. The approach is simple to integrate with existing encoders (e.g., BERT) and shows strong performance in few-shot and full-data regimes, highlighting the practical value of exploiting label taxonomy for more discriminative representations.

Abstract

Supervised contrastive learning (SCL) frameworks treat each class as independent and thus consider all classes to be equally important. This neglects the common scenario in which label hierarchy exists, where fine-grained classes under the same category show more similarity than very different ones. This paper introduces a family of Label-Aware SCL methods (LASCL) that incorporates hierarchical information to SCL by leveraging similarities between classes, resulting in creating a more well-structured and discriminative feature space. This is achieved by first adjusting the distance between instances based on measures of the proximity of their classes with the scaled instance-instance-wise contrastive. An additional instance-center-wise contrastive is introduced to move within-class examples closer to their centers, which are represented by a set of learnable label parameters. The learned label parameters can be directly used as a nearest neighbor classifier without further finetuning. In this way, a better feature representation is generated with improvements of intra-cluster compactness and inter-cluster separation. Experiments on three datasets show that the proposed LASCL works well on text classification of distinguishing a single label among multi-labels, outperforming the baseline supervised approaches. Our code is publicly available.

Learning Label Hierarchy with Supervised Contrastive Learning

TL;DR

This work addresses the limitation of standard supervised contrastive learning (SCL) which treats all classes as equally related by introducing LASCL, a framework that leverages label hierarchies to shape the embedding space. LASCL constructs learnable label representations from hierarchical label descriptions, computes class similarities, and uses them to scale the SCL objective, while adding an instance-centering term with both unweighted and weighted variants. The authors propose four LASCL variants (LI, LIUC, LIC, LISC) and demonstrate that incorporating hierarchical information yields improved intra-cluster compactness and inter-cluster separation across three text classification datasets, with the learned label centers usable as a direct nearest-neighbor classifier. The approach is simple to integrate with existing encoders (e.g., BERT) and shows strong performance in few-shot and full-data regimes, highlighting the practical value of exploiting label taxonomy for more discriminative representations.

Abstract

Supervised contrastive learning (SCL) frameworks treat each class as independent and thus consider all classes to be equally important. This neglects the common scenario in which label hierarchy exists, where fine-grained classes under the same category show more similarity than very different ones. This paper introduces a family of Label-Aware SCL methods (LASCL) that incorporates hierarchical information to SCL by leveraging similarities between classes, resulting in creating a more well-structured and discriminative feature space. This is achieved by first adjusting the distance between instances based on measures of the proximity of their classes with the scaled instance-instance-wise contrastive. An additional instance-center-wise contrastive is introduced to move within-class examples closer to their centers, which are represented by a set of learnable label parameters. The learned label parameters can be directly used as a nearest neighbor classifier without further finetuning. In this way, a better feature representation is generated with improvements of intra-cluster compactness and inter-cluster separation. Experiments on three datasets show that the proposed LASCL works well on text classification of distinguishing a single label among multi-labels, outperforming the baseline supervised approaches. Our code is publicly available.
Paper Structure (33 sections, 7 equations, 5 figures, 6 tables)

This paper contains 33 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Supervised v.s. label-aware supervised contrastive loss: The supervised contrastive loss (left) contrasts the set of all samples from the same class as positives against the negatives from the remainder of the batch khosla2020supervised. The label-aware supervised contrastive loss (right) proposed in our work incorporates label hierarchy by considering class similarities.
  • Figure 2: (a) The label hierarchy of the 20News dataset. The root node contains 7 classes, each branch has multiple fine-grained sub-categories. (b) t-SNE visualization of hierarchical label embeddings encoded by BERT-base.
  • Figure 3: Directly testing (DT) the k-shot prediction performance (measured by NodeAcc) on three datasets.
  • Figure 4: t-SNE visualization on 20News dataset (keep the original distribution) with (a) bert-base, (b) SCL, (c) LISC. Label representations are marked by appropriately colored "$\times$".
  • Figure 5: Measure the sensitivity to different hierarchies on 20News in (a) nodeAcc with different bottom-up label hierarchies ranging from 1-5. (b) nodeAcc on labels grouped by different hierarchies.