From Retrieval to Generation: Efficient and Effective Entity Set Expansion

Shulin Huang; Shirong Ma; Yangning Li; Yinghui Li; Hai-Tao Zheng

From Retrieval to Generation: Efficient and Effective Entity Set Expansion

Shulin Huang, Shirong Ma, Yangning Li, Yinghui Li, Hai-Tao Zheng

TL;DR

This paper addresses the inefficiency of retrieval-based entity set expansion (ESE) by proposing GenExpan, a corpus-independent generative framework that leverages a single pre-trained autoregressive language model. GenExpan guides generation through Class Name Generation via in-context learning, constrains output with a prefix-constrained decoding regime over a prefix tree, and enhances ranking with Knowledge Calibration and Generative Ranking. The method achieves substantial speedups that are independent of corpus size and vocabulary, while delivering strong expansion quality across four benchmark datasets. The work demonstrates practical impact by enabling scalable ESE without large-scale corpus processing and suggests promising directions for applying generative expansion to broader semantic classes and prompt design challenges.

Abstract

Entity Set Expansion (ESE) is a critical task aiming at expanding entities of the target semantic class described by seed entities. Most existing ESE methods are retrieval-based frameworks that need to extract contextual features of entities and calculate the similarity between seed entities and candidate entities. To achieve the two purposes, they iteratively traverse the corpus and the entity vocabulary, resulting in poor efficiency and scalability. Experimental results indicate that the time consumed by the retrieval-based ESE methods increases linearly with entity vocabulary and corpus size. In this paper, we firstly propose Generative Entity Set Expansion (GenExpan) framework, which utilizes a generative pre-trained auto-regressive language model to accomplish ESE task. Specifically, a prefix tree is employed to guarantee the validity of entity generation, and automatically generated class names are adopted to guide the model to generate target entities. Moreover, we propose Knowledge Calibration and Generative Ranking to further bridge the gap between generic knowledge of the language model and the goal of ESE task. For efficiency, expansion time consumed by GenExpan is independent of entity vocabulary and corpus size, and GenExpan achieves an average 600% speedup compared to strong baselines. For expansion effectiveness, our framework outperforms previous state-of-the-art ESE methods.

From Retrieval to Generation: Efficient and Effective Entity Set Expansion

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 6 figures, 5 tables)

This paper contains 18 sections, 6 equations, 6 figures, 5 tables.

Introduction
Related Work
Entity Set Expansion Methods
Entity Set Expansion Resources
Methodology
Problem Formulation
Overview of Methodology
Class Name Generation
Prefix-constrained Entity Generation
Knowledge Calibration
Generative Ranking
Experiments
Experiment Setup
Experiment Results
Ablation Studies
...and 3 more sections

Figures (6)

Figure 1: Total elapsed time for retrieval-based and generative methods. We select strong baseline CGExpan to represent retrieval-based methods. All experiments are run on one Nvidia RTX 3090 GPU. The two horizontal axes represent the number of candidate entities and corpus size, i.e., the number of sentences in the corpus. The vertical axis represents total elapsed time of entire expansion process. We execute 40 queries, with each expanding a minimum of 50 entities.
Figure 2: Overview of GenExpan framework. We employ the same pre-trained language model during whole expansion process.
Figure 3: Example of constrained entity decoding using "They are US States: Nevada, Texas, Ohio," as input. There are 2 cases: when we are outside entity generation (a), inside entity generation (b). The model is supposed to generate valid entities from entities prefix tree $T$ (e.g., "China", "Florida", "Florida State") based on the input.
Figure 4: Parameter sensitivity analysis of model size, beam size and ranking weight in GenExpan.
Figure 5: We select China Provinces and Sportsleagues classes because they lead to poor performance in case of no Ca(libration) and no Cl(ass Name), as shown in Table \ref{['tab:ablation']}. We mark the wrong entities in red.
...and 1 more figures

From Retrieval to Generation: Efficient and Effective Entity Set Expansion

TL;DR

Abstract

From Retrieval to Generation: Efficient and Effective Entity Set Expansion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)