Table of Contents
Fetching ...

Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning

Yunhui Liu, Yongchao Liu, Yinfeng Chen, Chuntao Hong, Tao Zheng, Tieke He

TL;DR

TIER is proposed, which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations, and introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure.

Abstract

Hierarchical knowledge structures are ubiquitous across real-world domains and play a vital role in organizing information from coarse to fine semantic levels. While such structures have been widely used in taxonomy systems, biomedical ontologies, and retrieval-augmented generation, their potential remains underexplored in the context of Text-Rich Networks (TRNs), where each node contains rich textual content and edges encode semantic relationships. Existing methods for learning on TRNs often focus on flat semantic modeling, overlooking the inherent hierarchical semantics embedded in textual documents. To this end, we propose TIER (Hierarchical \textbf{T}axonomy-\textbf{I}nformed R\textbf{E}presentation Learning on Text-\textbf{R}ich Networks), which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations. Specifically, TIER employs similarity-guided contrastive learning to build a clustering-friendly embedding space, upon which it performs hierarchical K-Means followed by LLM-powered clustering refinement to enable semantically coherent taxonomy construction. Leveraging the resulting taxonomy, TIER introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure. By learning representations that respect both fine-grained and coarse-grained semantics, TIER enables more interpretable and structured modeling of real-world TRNs. We demonstrate that our approach significantly outperforms existing methods on multiple datasets across diverse domains, highlighting the importance of hierarchical knowledge learning for TRNs.

Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning

TL;DR

TIER is proposed, which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations, and introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure.

Abstract

Hierarchical knowledge structures are ubiquitous across real-world domains and play a vital role in organizing information from coarse to fine semantic levels. While such structures have been widely used in taxonomy systems, biomedical ontologies, and retrieval-augmented generation, their potential remains underexplored in the context of Text-Rich Networks (TRNs), where each node contains rich textual content and edges encode semantic relationships. Existing methods for learning on TRNs often focus on flat semantic modeling, overlooking the inherent hierarchical semantics embedded in textual documents. To this end, we propose TIER (Hierarchical \textbf{T}axonomy-\textbf{I}nformed R\textbf{E}presentation Learning on Text-\textbf{R}ich Networks), which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations. Specifically, TIER employs similarity-guided contrastive learning to build a clustering-friendly embedding space, upon which it performs hierarchical K-Means followed by LLM-powered clustering refinement to enable semantically coherent taxonomy construction. Leveraging the resulting taxonomy, TIER introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure. By learning representations that respect both fine-grained and coarse-grained semantics, TIER enables more interpretable and structured modeling of real-world TRNs. We demonstrate that our approach significantly outperforms existing methods on multiple datasets across diverse domains, highlighting the importance of hierarchical knowledge learning for TRNs.
Paper Structure (30 sections, 1 theorem, 5 equations, 12 figures, 10 tables)

This paper contains 30 sections, 1 theorem, 5 equations, 12 figures, 10 tables.

Key Result

Theorem 1

Given a graph with edge homophily $h > 0.5$, the constructed similarity matrix $\boldsymbol{S}$ more closely approximates the ideal matrix $\boldsymbol{S}^*$ compared to both classic contrastive learning SimCLRSCL and SupCon SupCon. Consequently, minimizing the similarity-guided contrastive loss Eq.

Figures (12)

  • Figure 1: An example taxonomy of computer science papers.
  • Figure 2: The framework of TIER.
  • Figure 3: Visualizations of the learned node representations (colored by ground-truth labels) and the pairwise distance matrix between finest-level cluster centroids on Citeseer, with and without taxonomy regularization. The constructed taxonomy tree has 3 hierarchical levels with 1, 6, and 62 cluster nodes from top to bottom, respectively. Darker colors indicate smaller distances. With taxonomy regularization, clearer block structures emerge, where darker diagonal blocks correspond to coarser-grained clusters, reflecting improved semantic hierarchy in learned representation space.
  • Figure 4: RadialMap of the constructed taxonomy on Citeseer.
  • Figure 5: How the accuracy varies with different values of $\lambda$.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1