Assessing Large Language Models on Climate Information
Jannis Bulian, Mike S. Schäfer, Afra Amini, Heidi Lam, Massimiliano Ciaramita, Ben Gaiarin, Michelle Chen Hübscher, Christian Buck, Niels G. Mede, Markus Leippold, Nadine Strauß
TL;DR
Assessing climate information produced by LLMs is critical due to misinformation risks and the evolving nature of climate science. The authors present a science‑communication–informed evaluation framework and a large human study evaluating 300 climate questions across multiple LLMs, with a scalable AI-assisted oversight protocol. They find a persistent gap: surface presentational quality outpaces epistemological adequacy, though AI assistance improves rating fidelity. The work informs safer deployment of LLMs for climate information and identifies concrete directions to strengthen grounding, uncertainty communication, and multimodal evaluation.
Abstract
As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
