Table of Contents
Fetching ...

Assessing the Effectiveness of GPT-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights

Elphin Tom Joe, Sai Dileep Koneru, Christine J Kirchhoff

TL;DR

This study evaluates GPT-4o as a tool for climate adaptation evidence synthesis using a GAMI-derived dataset focused on the food sector. It assesses three information-extraction tasks across escalating expertise levels—geographic location (low), stakeholders (medium), and depth of adaptation (high)—via a structured prompting strategy and intermediate verification. Results show high accuracy for low-level tasks but notable declines for medium and high-level tasks, revealing core challenges in domain-specific reasoning and taxonomy-driven classification. The findings emphasize that, while GPT-4o can augment expert workflows, robust, human-in-the-loop designs and task-specific data are needed to reliably support high-stakes climate adaptation assessments.

Abstract

In this research short, we examine the potential of using GPT-4o, a state-of-the-art large language model (LLM) to undertake evidence synthesis and systematic assessment tasks. Traditional workflows for such tasks involve large groups of domain experts who manually review and synthesize vast amounts of literature. The exponential growth of scientific literature and recent advances in LLMs provide an opportunity to complementing these traditional workflows with new age tools. We assess the efficacy of GPT-4o to do these tasks on a sample from the dataset created by the Global Adaptation Mapping Initiative (GAMI) where we check the accuracy of climate change adaptation related feature extraction from the scientific literature across three levels of expertise. Our results indicate that while GPT-4o can achieve high accuracy in low-expertise tasks like geographic location identification, their performance in intermediate and high-expertise tasks, such as stakeholder identification and assessment of depth of the adaptation response, is less reliable. The findings motivate the need for designing assessment workflows that utilize the strengths of models like GPT-4o while also providing refinements to improve their performance on these tasks.

Assessing the Effectiveness of GPT-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights

TL;DR

This study evaluates GPT-4o as a tool for climate adaptation evidence synthesis using a GAMI-derived dataset focused on the food sector. It assesses three information-extraction tasks across escalating expertise levels—geographic location (low), stakeholders (medium), and depth of adaptation (high)—via a structured prompting strategy and intermediate verification. Results show high accuracy for low-level tasks but notable declines for medium and high-level tasks, revealing core challenges in domain-specific reasoning and taxonomy-driven classification. The findings emphasize that, while GPT-4o can augment expert workflows, robust, human-in-the-loop designs and task-specific data are needed to reliably support high-stakes climate adaptation assessments.

Abstract

In this research short, we examine the potential of using GPT-4o, a state-of-the-art large language model (LLM) to undertake evidence synthesis and systematic assessment tasks. Traditional workflows for such tasks involve large groups of domain experts who manually review and synthesize vast amounts of literature. The exponential growth of scientific literature and recent advances in LLMs provide an opportunity to complementing these traditional workflows with new age tools. We assess the efficacy of GPT-4o to do these tasks on a sample from the dataset created by the Global Adaptation Mapping Initiative (GAMI) where we check the accuracy of climate change adaptation related feature extraction from the scientific literature across three levels of expertise. Our results indicate that while GPT-4o can achieve high accuracy in low-expertise tasks like geographic location identification, their performance in intermediate and high-expertise tasks, such as stakeholder identification and assessment of depth of the adaptation response, is less reliable. The findings motivate the need for designing assessment workflows that utilize the strengths of models like GPT-4o while also providing refinements to improve their performance on these tasks.
Paper Structure (10 sections, 1 figure, 2 tables)

This paper contains 10 sections, 1 figure, 2 tables.