Table of Contents
Fetching ...

CLINB: A Climate Intelligence Benchmark for Foundational Models

Michelle Chen Huebscher, Katharine Mach, Aleksandar Stanić, Markus Leippold, Ben Gaiarin, Zeke Hausfather, Elisa Rawat, Erich Fischer, Massimiliano Ciaramita, Joeri Rogelj, Christian Buck, Lierni Sestorain Saralegui, Reto Knutti

TL;DR

The paper presents CLINB, a Climate Intelligence Benchmark designed to evaluate open-ended, multimodal climate knowledge with rigorous evidential grounding. It fuses real user questions, expert-curated rubrics, and a model-based autorater to enable scalable, expert-aligned assessment of frontier models. Key findings show frontier models achieve PhD-level synthesis and presentation but frequently hallucinate citations and images, exposing a gap between knowledge generation and verifiable attribution. The work demonstrates that a robust, rubric-driven, human-in-the-loop framework can improve evaluation fidelity, while also revealing the need for adaptive rubrics and improved evidence handling to support trustworthy AI in scientific workflows. Together, these results motivate further development of interpretable benchmarks and collaborative interfaces that bridge AI capabilities with rigorous scientific validation.

Abstract

Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users' questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform "hybrid" answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.

CLINB: A Climate Intelligence Benchmark for Foundational Models

TL;DR

The paper presents CLINB, a Climate Intelligence Benchmark designed to evaluate open-ended, multimodal climate knowledge with rigorous evidential grounding. It fuses real user questions, expert-curated rubrics, and a model-based autorater to enable scalable, expert-aligned assessment of frontier models. Key findings show frontier models achieve PhD-level synthesis and presentation but frequently hallucinate citations and images, exposing a gap between knowledge generation and verifiable attribution. The work demonstrates that a robust, rubric-driven, human-in-the-loop framework can improve evaluation fidelity, while also revealing the need for adaptive rubrics and improved evidence handling to support trustworthy AI in scientific workflows. Together, these results motivate further development of interpretable benchmarks and collaborative interfaces that bridge AI capabilities with rigorous scientific validation.

Abstract

Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users' questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform "hybrid" answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.

Paper Structure

This paper contains 67 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The multi-step, human-in-the-loop process to construct the CLINB dataset.
  • Figure 2: Left: Hybrid answers have more references than LLM answers. Right: Example of a pairwise preference graph for a single question.
  • Figure 3: Number of reference (top), and image (bottom), URLs and their status.
  • Figure 4: Distribution of topics and difficulty level by working group.
  • Figure 5: Distributions of counts for references (a) and images (b) in hybrid and LLM answers.
  • ...and 7 more figures