Table of Contents
Fetching ...

SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

Filip Kučera, Christoph Mandl, Isao Echizen, Radu Timofte, Timo Spinde

TL;DR

SciDef presents an LLM-driven pipeline for automated extraction of definitions from scientific literature and introduces two benchmarks, DefExtra and DefSim, to evaluate extraction quality and definitional similarity. Through extensive experiments over 16 models and multiple prompting strategies, the authors show that multi-step prompting and DSPy-optimized prompts yield higher extraction quality, with NLI-based similarity providing the most reliable evaluation. The work reports that LLMs can recover a large majority of ground-truth definitions (around 86%), yet over-generation and relevance filtering remain key challenges for real-world deployment. By releasing DefExtra, DefSim, and SciDef, the study lays groundwork for scalable definitional taxonomy construction, while acknowledging cost and domain limitations that motivate further research.

Abstract

Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra & DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.

SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

TL;DR

SciDef presents an LLM-driven pipeline for automated extraction of definitions from scientific literature and introduces two benchmarks, DefExtra and DefSim, to evaluate extraction quality and definitional similarity. Through extensive experiments over 16 models and multiple prompting strategies, the authors show that multi-step prompting and DSPy-optimized prompts yield higher extraction quality, with NLI-based similarity providing the most reliable evaluation. The work reports that LLMs can recover a large majority of ground-truth definitions (around 86%), yet over-generation and relevance filtering remain key challenges for real-world deployment. By releasing DefExtra, DefSim, and SciDef, the study lays groundwork for scalable definitional taxonomy construction, while acknowledging cost and domain limitations that motivate further research.

Abstract

Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra & DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.
Paper Structure (14 sections, 2 equations, 6 figures, 4 tables)

This paper contains 14 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Definition extraction workflow. Left: Datasets and metrics evaluated for our definition-similarity task. Right: Pool of LLMs evaluated on our datasets to pick the strongest prompt & model combination using the previously selected metric.
  • Figure 2: Top-10 extractor configurations by test score.
  • Figure 3: Best performing model for each metric across datasets, with GT threshold $0.90$ and model threshold set to maximize its performance.
  • Figure 4: Metric performance across datasets at a strict $0.95$ threshold.
  • Figure 5: Test score vs. average predicted definitions per paper.
  • ...and 1 more figures