ConExion: Concept Extraction with Large Language Models
Ebrahim Norouzi, Sven Hertling, Harald Sack
TL;DR
ConExion tackles concept extraction by leveraging pre-trained LLMs to extract present domain concepts from scientific texts, aligning extracted terms with the source document to support ontology evaluation. The framework relies on a modular prompt design (zero-shot and few-shot variants) and a confidence metric derived from token-probability histories to rank concepts. Through experiments on Inspec and SemEval2017, ConExion demonstrates improved F1 scores over several baselines, with the best results obtained using Few-shot 1-Random prompts on a large model (e.g., Llama3 70B), while noting weaker recall relative to graph-based methods. The work emphasizes reproducibility and outlines future directions including domain-annotated data, token-constrained generation, and instruction tuning to enhance recall and domain coverage in concept extraction.
Abstract
In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at https://github.com/ISE-FIZKarlsruhe/concept_extraction.
