Are Large Language Models a Good Replacement of Taxonomies?
Yushi Sun, Hao Xin, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
TL;DR
This paper introduces TaxoGlimpse, a taxonomy-focused benchmark designed to evaluate how well large language models capture hierarchical Is-A relations across a spectrum of domains from common to specialized. By testing $18$ models across $10$ taxonomies with $zero$-shot, $few$-shot, and Chain-of-Thought prompting, and using metrics $A$ (accuracy) and $M$ (miss rate), it reveals a consistent root-to-leaf decline in performance on leaf-level entities in specialized taxonomies. The study further shows that domain-specific instruction tuning yields the most reliable gains, while simple scaling or domain-agnostic tuning offers limited improvements, and prompting settings mainly reduce misses rather than raise accuracy. The authors argue for a hybrid neural-symbolic taxonomy, combining LLM-based implicit knowledge with explicit tree structures, and demonstrate a case study on the Amazon Product Category that suggests substantial maintenance-cost savings with acceptable precision/recall. Overall, TaxoGlimpse provides a rigorous, scalable framework for evaluating taxonomy knowledge in LLMs and highlights practical paths for ontology learning and taxonomy design in real-world applications.
Abstract
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
