Table of Contents
Fetching ...

Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models

Jennifer D'Souza, Zachary Laubach, Tarek Al Mustafa, Sina Zarrieß, Robert Frühstückl, Phyllis Illari

TL;DR

The paper investigates using general-purpose LLMs to extract ecological entities (species, location, habitat, ecosystem) from invasion biology literature without domain-specific fine-tuning. It introduces a two-stage workflow—specialize and generalize—for discovering and merging extraction schemas, enabling scalable information extraction over a large corpus. A large text data mining corpus (>10,000 papers) and a public workflow for schema discovery (plus corresponding code and data releases) are presented, highlighting both the promise and limitations of unsupervised LLM-based IE in ecology. The work advances automated knowledge extraction in invasion biology, offering resources and methodology to support systematic reviews, conservation planning, and ecological risk assessment.

Abstract

This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.

Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models

TL;DR

The paper investigates using general-purpose LLMs to extract ecological entities (species, location, habitat, ecosystem) from invasion biology literature without domain-specific fine-tuning. It introduces a two-stage workflow—specialize and generalize—for discovering and merging extraction schemas, enabling scalable information extraction over a large corpus. A large text data mining corpus (>10,000 papers) and a public workflow for schema discovery (plus corresponding code and data releases) are presented, highlighting both the promise and limitations of unsupervised LLM-based IE in ecology. The work advances automated knowledge extraction in invasion biology, offering resources and methodology to support systematic reviews, conservation planning, and ecological risk assessment.

Abstract

This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.

Paper Structure

This paper contains 8 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Distribution of papers in our corpus with abstracts and with full text over the past 20 years.
  • Figure 2: Distribution of papers, in our corpus, by abstract and full-text availability across the top ten publishers.