Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst Scaling
Thomas Haider, Tobias Perschl, Malte Rehbein
TL;DR
The paper addresses the challenge of extracting quantitative biodiversity evidence from historical, semi-structured survey texts. It compares plain classification approaches with a continuous quantification regime based on Best-Worst Scaling (BWS) guided by Large Language Models (LLMs), evaluating Ministral-8B, DeepSeek-V3, and GPT-4. Results show GPT-4 and DeepSeek-V3 achieve reasonable human–model agreement, and a transfer-learning regression setup using GPT-4 BWS with LaBSE features attains $R^2=0.73$ and MAE=$0.11$, illustrating a cost-effective path to fine-grained quantity estimation on historical data, with outputs scaled to $[0,1]$. The work demonstrates that BWS with LLMs can yield robust quantitative biodiversity metrics from historical texts and is potentially generalizable to other archival corpora, while acknowledging limitations from model biases and data availability.
Abstract
In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.
