Table of Contents
Fetching ...

Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst Scaling

Thomas Haider, Tobias Perschl, Malte Rehbein

TL;DR

The paper addresses the challenge of extracting quantitative biodiversity evidence from historical, semi-structured survey texts. It compares plain classification approaches with a continuous quantification regime based on Best-Worst Scaling (BWS) guided by Large Language Models (LLMs), evaluating Ministral-8B, DeepSeek-V3, and GPT-4. Results show GPT-4 and DeepSeek-V3 achieve reasonable human–model agreement, and a transfer-learning regression setup using GPT-4 BWS with LaBSE features attains $R^2=0.73$ and MAE=$0.11$, illustrating a cost-effective path to fine-grained quantity estimation on historical data, with outputs scaled to $[0,1]$. The work demonstrates that BWS with LLMs can yield robust quantitative biodiversity metrics from historical texts and is potentially generalizable to other archival corpora, while acknowledging limitations from model biases and data availability.

Abstract

In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.

Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst Scaling

TL;DR

The paper addresses the challenge of extracting quantitative biodiversity evidence from historical, semi-structured survey texts. It compares plain classification approaches with a continuous quantification regime based on Best-Worst Scaling (BWS) guided by Large Language Models (LLMs), evaluating Ministral-8B, DeepSeek-V3, and GPT-4. Results show GPT-4 and DeepSeek-V3 achieve reasonable human–model agreement, and a transfer-learning regression setup using GPT-4 BWS with LaBSE features attains and MAE=, illustrating a cost-effective path to fine-grained quantity estimation on historical data, with outputs scaled to . The work demonstrates that BWS with LLMs can yield robust quantitative biodiversity metrics from historical texts and is potentially generalizable to other archival corpora, while acknowledging limitations from model biases and data availability.

Abstract

In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.

Paper Structure

This paper contains 14 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Facsimile of a survey page, Freysing forestry office in the Upper Bavaria district.
  • Figure 2: Training Curves of different models on incremental training data (binary classification)
  • Figure 3: Multi-Class vs. Regression Distribution
  • Figure 4: Density histogram of regressor prediction (top) and multi-class (bottom) distribution for Roe deer (SP_0015, red) and Eurasian otter (SP_0005, grey).