Table of Contents
Fetching ...

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

Genet Asefa Gesese, Zongxiong Chen, Shufan Jiang, Mary Ann Tan, Zhaotai Liu, Sonja Schimmler, Harald Sack

TL;DR

NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

Abstract

Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

TL;DR

NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

Abstract

Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
Paper Structure (6 sections, 2 figures, 4 tables)

This paper contains 6 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of an annotated README file in the dataset. Note that unlabeled sentences are NOT negative samples.
  • Figure 2: Span statistics in NERdME. Bars show the proportion of spans across train, validation, and test splits for each entity type; numbers in parentheses indicate total spans per type.