Table of Contents
Fetching ...

Variable Extraction for Model Recovery in Scientific Literature

Chunwei Liu, Enrique Noriega-Atala, Adarsh Pyarelal, Clayton T Morrison, Mike Cafarella

TL;DR

The potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation is demonstrated, with LLM-based solutions performing the best.

Abstract

The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ``infection rate ($α$),'' ``recovery rate ($γ$),'' and ``mortality rate ($μ$).'' Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results. We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Based on this dataset, we present several baseline methods for variable extraction based on Large Language Models (LLMs) and rule-based information extraction systems. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation.

Variable Extraction for Model Recovery in Scientific Literature

TL;DR

The potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation is demonstrated, with LLM-based solutions performing the best.

Abstract

The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ``infection rate (),'' ``recovery rate (),'' and ``mortality rate ().'' Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results. We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Based on this dataset, we present several baseline methods for variable extraction based on Large Language Models (LLMs) and rule-based information extraction systems. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation.

Paper Structure

This paper contains 21 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example of variable extraction from a scientific paper text, illustrating the process of identifying and extracting elements such as the variable name, description, and initial value into a structured format. The figure highlights different types of extraction: variable description pairs in light orange and variable value pairs in light purple.
  • Figure 2: Example of SciVar JSON output extracted and formatted from an annotated PDF text block in Figure \ref{['fig:variable_extraction']}.
  • Figure 3: Example of a pattern-matching rule system designed to detect variable descriptions. The word interpreted will anchor the pattern (line 8). Outgoing syntactic dependencies of types nmod_as and nsubjpass to entities of types Phrase and Identifier link the rule's trigger to its description and variable arguments, respectively.
  • Figure 4: Prompt templates for variable extraction using various setups. The black font indicates the prompt template for a standard LLM. The combination of black and brown fonts represents the template for few-shot prompting. The integration of black and blue fonts denotes the template enhanced by external tools.
  • Figure 5: Palimpzest Code for Variable Extraction from Scientific Paper Snippets.
  • ...and 1 more figures