Hidden Entity Detection from GitHub Leveraging Large Language Models
Lu Gan, Martin Blum, Danilo Dessi, Brigitte Mathiak, Ralf Schenkel, Stefan Dietze
TL;DR
This work tackles the problem of automatically detecting datasets and software mentions embedded as URLs within GitHub READMEs to support knowledge graph population. It examines zero-shot and few-shot capabilities of two LLMs (LLaMA 2 and Mistral 7B) with quantized variants, using static and dynamic prompts to perform two tasks: Extraction and Classification (E+CL) and Classification (CL). A manually annotated gold-standard dataset of 811 repositories with 1439 URLs is created, and a post-processing pipeline with JSON formatting and URL matching is proposed. The findings reveal limited parsing precision and recall for LLM-based extraction/classification compared with non-LLM baselines, highlighting the need for improved prompts, robust post-processing, and possibly hybrid approaches for reliable knowledge-graph population from code-related sources.
Abstract
Named entity recognition is an important task when constructing knowledge bases from unstructured data sources. Whereas entity detection methods mostly rely on extensive training data, Large Language Models (LLMs) have paved the way towards approaches that rely on zero-shot learning (ZSL) or few-shot learning (FSL) by taking advantage of the capabilities LLMs acquired during pretraining. Specifically, in very specialized scenarios where large-scale training data is not available, ZSL / FSL opens new opportunities. This paper follows this recent trend and investigates the potential of leveraging Large Language Models (LLMs) in such scenarios to automatically detect datasets and software within textual content from GitHub repositories. While existing methods focused solely on named entities, this study aims to broaden the scope by incorporating resources such as repositories and online hubs where entities are also represented by URLs. The study explores different FSL prompt learning approaches to enhance the LLMs' ability to identify dataset and software mentions within repository texts. Through analyses of LLM effectiveness and learning strategies, this paper offers insights into the potential of advanced language models for automated entity detection.
