Table of Contents
Fetching ...

HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

He Zhu, Junran Wu, Ruomei Liu, Yue Hou, Ze Yuan, Shangzhe Li, Yicheng Pan, Ke Xu

TL;DR

The paper tackles hierarchical text classification (HTC) with self-supervised contrastive learning, identifying that input augmentation can distort semantic content. It introduces HILL, a framework where a text encoder (BERT-based) and a structure encoder collaborate: the structure encoder builds a coding tree of the label hierarchy via structural entropy minimization and produces an information-rich positive view $h_T$ that is fused with the text view $h_D$ through a contrastive objective. A formal information lossless learning principle is proven, showing that the mutual information preserved by HILL upper-bounds that of augmentation-based methods. Empirically, HILL achieves state-of-the-art results on three HTC datasets (WOS, RCV1-v2, NYTimes), with notable improvements over baselines and efficient training due to a compact structure-encoder design. The work provides a principled, scalable path to incorporating label-structure into representation learning for HTC, with practical implications for hierarchical NLP tasks.

Abstract

Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.

HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

TL;DR

The paper tackles hierarchical text classification (HTC) with self-supervised contrastive learning, identifying that input augmentation can distort semantic content. It introduces HILL, a framework where a text encoder (BERT-based) and a structure encoder collaborate: the structure encoder builds a coding tree of the label hierarchy via structural entropy minimization and produces an information-rich positive view that is fused with the text view through a contrastive objective. A formal information lossless learning principle is proven, showing that the mutual information preserved by HILL upper-bounds that of augmentation-based methods. Empirically, HILL achieves state-of-the-art results on three HTC datasets (WOS, RCV1-v2, NYTimes), with notable improvements over baselines and efficient training due to a compact structure-encoder design. The work provides a principled, scalable path to incorporating label-structure into representation learning for HTC, with practical implications for hierarchical NLP tasks.

Abstract

Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.
Paper Structure (35 sections, 1 theorem, 20 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 35 sections, 1 theorem, 20 equations, 10 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Given a document $D$ and the coding tree $T_L$ of the label hierarchy $G_L$. Denote their random variable as $\mathcal{D}$, $\mathcal{T_L}$, and $\mathcal{G_L}$. For any augmentation function $\theta$, we have,

Figures (10)

  • Figure 1: Comparison between HILL and previous methods. (a) Previous work use structure encoder in data augmentation. (b) Our method extracting syntactic information in information lossless learning paradigm.
  • Figure 2: An example of our model with $K=3$. We first feed the document $D$ into the text encoder to extract the semantic information. Next, the structure encoder takes label hierarchy $G_L$ as input and constructs the optimal coding tree $T_L$ with Algorithm \ref{['alg:1']} under the guidance of structural entropy. In the hierarchical representation learning module, the leaf node embeddings are initialized by the document embeddings, and the representations of non-leaf nodes are learned from bottom to top. The structure encoder finally generates an information lossless positive view for the text encoder, which is formulated in Section \ref{['sec:proof']} and proved in Appendix \ref{['apdx:proof']}.
  • Figure 3: Test performance of HILL with different height $K$ of the coding tree on three datasets.
  • Figure 4: The number of trainable parameters (M) and the average training time (s) of our model and HGCLR on WOS, RCV1-v2, and NYTimes.
  • Figure 5: An illustration of coding trees and structural entropy. The coding tree $T$ provides us with multi-granularity partitions of the original graph $G$, as shown by the three partitions in the example. Structural entropy is defined as the average amount of information of a random walk between two nodes in $V_G$, considering all nodes partitioned (encoded and decoded) by coding tree $T$. Under the guidance of structural entropy, coding tree $T$ could reveal the essential structure of graph $G$.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Theorem 1