Taxonomy Tree Generation from Citation Graph
Yuntong Hu, Zhuofeng Li, Zheng Zhang, Chen Ling, Raasikh Kanjiani, Boxin Zhao, Liang Zhao
TL;DR
The paper tackles automatic taxonomy generation from citation graphs to organize scientific knowledge and support literature reviews. It introduces HiGTL (Hierarchical Graph Taxonomy Learning), an end-to-end framework that jointly learns hierarchical clustering of papers and verbalizes taxonomy nodes through iterative graph-to-text generation guided by user prompts, with an explicit decomposition $f = h\circ g$. Key contributions include a hierarchical clustering module with CLU and AGG operators and a hierarchical contrastive loss $\mathcal{L}_{\text{HiMulCon}}$, a hierarchical taxonomy node verbalization objective $\mathcal{L}_{\text{Gen}}$ driven by an LLM, and a two-phase optimization leveraging pretraining and LoRA fine-tuning. Experiments on 518 citation graphs from computer science literature reviews demonstrate state-of-the-art taxonomy quality (e.g., Coverage $=0.9357$, Structure $=0.9413$, BertScore $=0.8694$) and superior taxonomy-guided literature review generation (HiReview) compared to baselines, confirming the framework’s practical impact for knowledge discovery, trend identification, and scalable literature synthesis.
Abstract
Constructing taxonomies from citation graphs is essential for organizing scientific knowledge, facilitating literature reviews, and identifying emerging research trends. However, manual taxonomy construction is labor-intensive, time-consuming, and prone to human biases, often overlooking pivotal but less-cited papers. In this paper, to enable automatic hierarchical taxonomy generation from citation graphs, we propose HiGTL (Hierarchical Graph Taxonomy Learning), a novel end-to-end framework guided by human-provided instructions or preferred topics. Specifically, we propose a hierarchical citation graph clustering method that recursively groups related papers based on both textual content and citation structure, ensuring semantically meaningful and structurally coherent clusters. Additionally, we develop a novel taxonomy node verbalization strategy that iteratively generates central concepts for each cluster, leveraging a pre-trained large language model (LLM) to maintain semantic consistency across hierarchical levels. To further enhance performance, we design a joint optimization framework that fine-tunes both the clustering and concept generation modules, aligning structural accuracy with the quality of generated taxonomies. Extensive experiments demonstrate that HiGTL effectively produces coherent, high-quality taxonomies.
