Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models

Wolfgang Otto; Sharmila Upadhyaya; Stefan Dietze

Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models

Wolfgang Otto, Sharmila Upadhyaya, Stefan Dietze

TL;DR

This work tackles software mentions disambiguation in scholarly texts (SOMD) by fusing Retrieval-Augmented Generation with generative LLMs to perform NER and relation extraction. It reframes relation extraction as a single-choice question-answering task and employs in-context learning and task-specific retrieval strategies to maximize information extraction accuracy. Experimental results show that LLM-based, retrieval-enhanced approaches can approach or match strong baselines, particularly for attributive NER and RE, while highlighting challenges such as hallucinations and span-matching. The study demonstrates a viable path for accurate, scalable software-mention analysis in scholarly literature, with implications for transparency and reproducibility and directions for future domain-specific IE research.

Abstract

This paper describes our participation in the Shared Task on Software Mentions Disambiguation (SOMD), with a focus on improving relation extraction in scholarly texts through generative Large Language Models (LLMs) using single-choice question-answering. The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities and their descriptive attributes, such as distributive information. Our approach uses Retrieval-Augmented Generation (RAG) techniques and GLMs for Named Entity Recognition (NER) and Attributive NER to identify relationships between extracted software entities, providing a structured solution for analysing software citations in academic literature. The paper provides a detailed description of our approach, demonstrating how using GLMs in a single-choice QA paradigm can greatly enhance IE methodologies. Our participation in the SOMD shared task highlights the importance of precise software citation practices and showcases our system's ability to overcome the challenges of disambiguating and extracting relationships between software mentions. This sets the groundwork for future research and development in this field.

Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 3 figures, 5 tables)

This paper contains 21 sections, 3 figures, 5 tables.

Introduction
Related Work
SOMD Shared Task
Using LLMs for Software Related IE-Tasks
Challenges in Applying LLMs to NER Tasks
Sample Retrieval for RAG on various IE-Tasks
Extraction of Software Entities
Extraction of Software Attributes
Relation Extraction as Single-Choice Question Answering Task
Experiments
Models
Prompting
Train Sample Retrieval for Few-Shot Generation
Relation Extraction Baseline
Results
...and 6 more sections

Figures (3)

Figure 1: Software NER Few-Shot prompt (n=2). The shown sample is the same as in the similarity examples in Table \ref{['table:sim_ent']} and \ref{['table:sim_sent']}.
Figure 2: Attibutive NER Few-Shot prompt (n=2).
Figure 3: Single-Choice QA prompt for Relation Annotation (n=2).

Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models

TL;DR

Abstract

Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)