Table of Contents
Fetching ...

Assessing Large Language Models on Climate Information

Jannis Bulian, Mike S. Schäfer, Afra Amini, Heidi Lam, Massimiliano Ciaramita, Ben Gaiarin, Michelle Chen Hübscher, Christian Buck, Niels G. Mede, Markus Leippold, Nadine Strauß

TL;DR

Assessing climate information produced by LLMs is critical due to misinformation risks and the evolving nature of climate science. The authors present a science‑communication–informed evaluation framework and a large human study evaluating 300 climate questions across multiple LLMs, with a scalable AI-assisted oversight protocol. They find a persistent gap: surface presentational quality outpaces epistemological adequacy, though AI assistance improves rating fidelity. The work informs safer deployment of LLMs for climate information and identifies concrete directions to strengthen grounding, uncertainty communication, and multimodal evaluation.

Abstract

As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.

Assessing Large Language Models on Climate Information

TL;DR

Assessing climate information produced by LLMs is critical due to misinformation risks and the evolving nature of climate science. The authors present a science‑communication–informed evaluation framework and a large human study evaluating 300 climate questions across multiple LLMs, with a scalable AI-assisted oversight protocol. They find a persistent gap: surface presentational quality outpaces epistemological adequacy, though AI assistance improves rating fidelity. The work informs safer deployment of LLMs for climate information and identifies concrete directions to strengthen grounding, uncertainty communication, and multimodal evaluation.

Abstract

As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
Paper Structure (46 sections, 15 figures, 35 tables)

This paper contains 46 sections, 15 figures, 35 tables.

Figures (15)

  • Figure 1: Overview of the evaluation pipeline as described in \ref{['sec:experiments']}. Starting with a question-answer pair, we use an LLM to extract key points from the answer. We also use the LLM to find a relevant Wikipedia page from which we extract paragraphs. For each key point we rank the paragraphs and keep the top ones. We combine all this information to generate AI assistance for each of our evaluation dimensions. Presentational dimensions are evaluated without the additional paragraphs. This assistance, if available, is presented to our human raters along with the answer. Note that the raters for presentational and epistemological dimension are not shown the key points or retrieved paragraphs. We however use the key points and paragraphs evaluate attribution, cf. \ref{['sec:ais']}.
  • Figure 2: Bootstrapped mean rating, and 95% confidence intervals, for all presentational (left) and epistemological (right) dimensions.
  • Figure 3: Number of issues detected depending on AI Assistance exposure.
  • Figure 4: The relationship between rating and reported helpfulness of the AI assistance (on the same scale).
  • Figure 5: Comparing AIS ratings with average ratings of the $4$ epistemological dimensions.
  • ...and 10 more figures