Table of Contents
Fetching ...

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan

TL;DR

AtmosSci-Bench addresses the need for a rigorous, domain-specific benchmark to evaluate large language models in atmospheric science. It combines MCQ and open-ended question formats across five core domains, employing a template-based MCQ generation pipeline with symbolic perturbations and a cascaded OEQ evaluation framework to probe deep reasoning. The study compares instruction-tuned, reasoning-optimized, math-augmented, and domain-specific climate models, revealing that reasoning-centered models substantially outperform others, while domain-focused models often underperform due to weaker stepwise reasoning. The benchmark demonstrates meaningful differentiation, analyzes inference-time scaling and robustness to symbolic perturbations, and provides open-source code and data to foster reproducible, climate-service–oriented AI research.

Abstract

The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

TL;DR

AtmosSci-Bench addresses the need for a rigorous, domain-specific benchmark to evaluate large language models in atmospheric science. It combines MCQ and open-ended question formats across five core domains, employing a template-based MCQ generation pipeline with symbolic perturbations and a cascaded OEQ evaluation framework to probe deep reasoning. The study compares instruction-tuned, reasoning-optimized, math-augmented, and domain-specific climate models, revealing that reasoning-centered models substantially outperform others, while domain-focused models often underperform due to weaker stepwise reasoning. The benchmark demonstrates meaningful differentiation, analyzes inference-time scaling and robustness to symbolic perturbations, and provides open-source code and data to foster reproducible, climate-service–oriented AI research.

Abstract

The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

Paper Structure

This paper contains 61 sections, 14 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Overview of AtmosSci-Bench
  • Figure 2: Construction pipeline of our template-based question generation framework. Red block display the question collecting process. Blue blocks represent the question generation process (variables are highlighted in different colors). Greed blocks depict the automatic problem solver, which derives the answer from given variables. Yellow blocks illustrate an example of a generated question and its corresponding options.
  • Figure 3: Reasoning step study. Accuracy (%) of different models across increasing input lengths.
  • Figure 4: Performance distribution among reasoning LLMs on MCQ30. The Y-axis represents the frequency of the symbolic test sets achieving the accuracy shown on the X-axis. The black vertical dash lines denote the accuracy of the original question set.