Are Large Language Models a Good Replacement of Taxonomies?

Yushi Sun; Hao Xin; Kai Sun; Yifan Ethan Xu; Xiao Yang; Xin Luna Dong; Nan Tang; Lei Chen

Are Large Language Models a Good Replacement of Taxonomies?

Yushi Sun, Hao Xin, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen

TL;DR

This paper introduces TaxoGlimpse, a taxonomy-focused benchmark designed to evaluate how well large language models capture hierarchical Is-A relations across a spectrum of domains from common to specialized. By testing $18$ models across $10$ taxonomies with $zero$-shot, $few$-shot, and Chain-of-Thought prompting, and using metrics $A$ (accuracy) and $M$ (miss rate), it reveals a consistent root-to-leaf decline in performance on leaf-level entities in specialized taxonomies. The study further shows that domain-specific instruction tuning yields the most reliable gains, while simple scaling or domain-agnostic tuning offers limited improvements, and prompting settings mainly reduce misses rather than raise accuracy. The authors argue for a hybrid neural-symbolic taxonomy, combining LLM-based implicit knowledge with explicit tree structures, and demonstrate a case study on the Amazon Product Category that suggests substantial maintenance-cost savings with acceptable precision/recall. Overall, TaxoGlimpse provides a rigorous, scalable framework for evaluating taxonomy knowledge in LLMs and highlights practical paths for ontology learning and taxonomy design in real-world applications.

Abstract

Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.

Are Large Language Models a Good Replacement of Taxonomies?

TL;DR

models across

taxonomies with

-shot,

-shot, and Chain-of-Thought prompting, and using metrics

(accuracy) and

(miss rate), it reveals a consistent root-to-leaf decline in performance on leaf-level entities in specialized taxonomies. The study further shows that domain-specific instruction tuning yields the most reliable gains, while simple scaling or domain-agnostic tuning offers limited improvements, and prompting settings mainly reduce misses rather than raise accuracy. The authors argue for a hybrid neural-symbolic taxonomy, combining LLM-based implicit knowledge with explicit tree structures, and demonstrate a case study on the Amazon Product Category that suggests substantial maintenance-cost savings with acceptable precision/recall. Overall, TaxoGlimpse provides a rigorous, scalable framework for evaluating taxonomy knowledge in LLMs and highlights practical paths for ontology learning and taxonomy design in real-world applications.

Abstract

Paper Structure (23 sections, 7 figures, 7 tables)

This paper contains 23 sections, 7 figures, 7 tables.

Introduction
Benchmark Construction and Question Design
Benchmark Construction
Question Design
Experimental settings
Large Language Models
Implementation Details
Metrics
Experimental results
How reliable are LLMs for discovering hierarchical structures in different taxonomies?
Do LLMs perform equally well among different levels of taxonomies?
Do normal methods that improve LLMs increase the accuracy?
Do different prompting settings influence the performance?
Instance Typing
Discussion
...and 8 more sections

Figures (7)

Figure 1: Exemplar chain of entities snippets of ten taxonomies. From top to bottom, we list the taxonomy snippets from common domains to specialized domains. From left to right, we present entities from the root to leaf levels.
Figure 2: The popularity of different taxonomies.
Figure 3: Accuracies for different levels of questions in hard datasets of different taxonomies under the zero-shot setting.
Figure 4: Radar charts for representative LLMs under different prompting settings in hard datasets.
Figure 5: Few-shot and Chain-of-Thoughts examples.
...and 2 more figures

Theorems & Definitions (1)

Example 1

Are Large Language Models a Good Replacement of Taxonomies?

TL;DR

Abstract

Are Large Language Models a Good Replacement of Taxonomies?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (1)