Table of Contents
Fetching ...

A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion

Yanzhen Shen, Yu Zhang, Yunyi Zhang, Jiawei Han

TL;DR

TaxoInstruct presents a unified taxonomy-guided instruction-tuning framework that jointly addresses entity set expansion, taxonomy expansion, and seed-guided taxonomy construction by teaching an LLM to learn sibling- and parent-finding skills. It employs a two-stage approach: a large-scale pre-training on a real taxonomy (CTD MEDIC) to instill these skills, followed by domain-specific fine-tuning with structured prompts. Across six benchmark datasets, TaxoInstruct consistently outperforms task-specific baselines and demonstrates robustness across multiple LLM backbones, validating the benefits of a unified, instruction-tuned strategy for taxonomy enrichment. This work offers a practical pathway to automatic, scalable taxonomy expansion across domains with improved accuracy and generalization.

Abstract

Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatically populate an existing taxonomy with emerging concepts. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding "siblings" and finding "parents". We propose a taxonomy-guided instruction tuning framework to teach a large language model to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.

A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion

TL;DR

TaxoInstruct presents a unified taxonomy-guided instruction-tuning framework that jointly addresses entity set expansion, taxonomy expansion, and seed-guided taxonomy construction by teaching an LLM to learn sibling- and parent-finding skills. It employs a two-stage approach: a large-scale pre-training on a real taxonomy (CTD MEDIC) to instill these skills, followed by domain-specific fine-tuning with structured prompts. Across six benchmark datasets, TaxoInstruct consistently outperforms task-specific baselines and demonstrates robustness across multiple LLM backbones, validating the benefits of a unified, instruction-tuned strategy for taxonomy enrichment. This work offers a practical pathway to automatic, scalable taxonomy expansion across domains with improved accuracy and generalization.

Abstract

Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatically populate an existing taxonomy with emerging concepts. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding "siblings" and finding "parents". We propose a taxonomy-guided instruction tuning framework to teach a large language model to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.
Paper Structure (24 sections, 8 equations, 2 figures, 5 tables)

This paper contains 24 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustrations of the three tasks.
  • Figure 2: Illustration of the TaxoInstruct framework.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3