Table of Contents
Fetching ...

Comparative Study of Domain Driven Terms Extraction Using Large Language Models

Sandeep Chataut, Tuyen Do, Bichar Dip Shrestha Gurung, Shiva Aryal, Anup Khanal, Carol Lushbough, Etienne Gnimpieba

TL;DR

This work compares three large language-oriented keyword extraction approaches (GPT-3.5, Llama2-7B, Falcon-7B) on Inspec and PubMed using a LangChain-based extraction framework and Jaccard similarity against union ground-truth keywords. It emphasizes prompt engineering, analyzes hallucination effects, and reports both quantitative (scores) and qualitative (inference time, word clouds) findings. GPT-3.5 achieves the highest overlap with references, while Llama2-7B and Falcon-7B show varying degrees of term expansion and lower concordance, highlighting trade-offs between precision and coverage. The study provides practical insights for model selection and prompt design in domain-driven keyword extraction and outlines future directions for tooling, benchmarking, and hallucination mitigation.

Abstract

Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data. They are essential to data enrichment because they form the basis for detailed annotations that provide a more insightful and in-depth view of the underlying data. Keyword/domain driven term extraction is a pivotal task in natural language processing, facilitating information retrieval, document summarization, and content categorization. This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We employed a custom Python package to interface with these LLMs, simplifying keyword extraction. Our study, utilizing the Inspec and PubMed datasets, evaluates the performance of these models. The Jaccard similarity index was used for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for GPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This paper underlines the role of prompt engineering in LLMs for better keyword extraction and discusses the impact of hallucination in LLMs on result evaluation. It also sheds light on the challenges in using LLMs for keyword extraction, including model complexity, resource demands, and optimization techniques.

Comparative Study of Domain Driven Terms Extraction Using Large Language Models

TL;DR

This work compares three large language-oriented keyword extraction approaches (GPT-3.5, Llama2-7B, Falcon-7B) on Inspec and PubMed using a LangChain-based extraction framework and Jaccard similarity against union ground-truth keywords. It emphasizes prompt engineering, analyzes hallucination effects, and reports both quantitative (scores) and qualitative (inference time, word clouds) findings. GPT-3.5 achieves the highest overlap with references, while Llama2-7B and Falcon-7B show varying degrees of term expansion and lower concordance, highlighting trade-offs between precision and coverage. The study provides practical insights for model selection and prompt design in domain-driven keyword extraction and outlines future directions for tooling, benchmarking, and hallucination mitigation.

Abstract

Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data. They are essential to data enrichment because they form the basis for detailed annotations that provide a more insightful and in-depth view of the underlying data. Keyword/domain driven term extraction is a pivotal task in natural language processing, facilitating information retrieval, document summarization, and content categorization. This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We employed a custom Python package to interface with these LLMs, simplifying keyword extraction. Our study, utilizing the Inspec and PubMed datasets, evaluates the performance of these models. The Jaccard similarity index was used for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for GPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This paper underlines the role of prompt engineering in LLMs for better keyword extraction and discusses the impact of hallucination in LLMs on result evaluation. It also sheds light on the challenges in using LLMs for keyword extraction, including model complexity, resource demands, and optimization techniques.
Paper Structure (29 sections, 6 equations, 3 figures, 1 table)

This paper contains 29 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Workflow of Langchain framework
  • Figure 2: Comaprison of Jaccard Similarity score
  • Figure 3: Keyword visualization with word clouds