Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models
Jennifer D'Souza, Zachary Laubach, Tarek Al Mustafa, Sina Zarrieß, Robert Frühstückl, Phyllis Illari
TL;DR
The paper investigates using general-purpose LLMs to extract ecological entities (species, location, habitat, ecosystem) from invasion biology literature without domain-specific fine-tuning. It introduces a two-stage workflow—specialize and generalize—for discovering and merging extraction schemas, enabling scalable information extraction over a large corpus. A large text data mining corpus (>10,000 papers) and a public workflow for schema discovery (plus corresponding code and data releases) are presented, highlighting both the promise and limitations of unsupervised LLM-based IE in ecology. The work advances automated knowledge extraction in invasion biology, offering resources and methodology to support systematic reviews, conservation planning, and ecological risk assessment.
Abstract
This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
