Table of Contents
Fetching ...

Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Kevin Lee, Russell Spiewak, James Walsh

TL;DR

Reasoning With a Star presents a domain-grounded heliophysics reasoning benchmark derived from NASA/UCAR LWS problem sets and a STAR-inspired multi-agent framework to evaluate scientific reasoning in LLMs. It introduces a programmatic grader enforcing unit-consistent outputs, symbolic equivalence, and schema validity, and compares four agentic patterns (HMAW, PACE, PHASE, SCHEMA) against single-shot baselines across multiple datasets. The study finds no universal best pattern; compact, plan-oriented pipelines excel on arithmetic tasks, while structured coordination improves methodological formulation and verification in heliophysics problems, with SCHEMA particularly effective on format- and verification-heavy tasks. Together, the dataset, grader, and multi-agent comparisons offer a path toward auditable, domain-specific reasoning for space-science AI systems and motivate expanding RWS with more problem sets and failure annotations.

Abstract

Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

TL;DR

Reasoning With a Star presents a domain-grounded heliophysics reasoning benchmark derived from NASA/UCAR LWS problem sets and a STAR-inspired multi-agent framework to evaluate scientific reasoning in LLMs. It introduces a programmatic grader enforcing unit-consistent outputs, symbolic equivalence, and schema validity, and compares four agentic patterns (HMAW, PACE, PHASE, SCHEMA) against single-shot baselines across multiple datasets. The study finds no universal best pattern; compact, plan-oriented pipelines excel on arithmetic tasks, while structured coordination improves methodological formulation and verification in heliophysics problems, with SCHEMA particularly effective on format- and verification-heavy tasks. Together, the dataset, grader, and multi-agent comparisons offer a path toward auditable, domain-specific reasoning for space-science AI systems and motivate expanding RWS with more problem sets and failure annotations.

Abstract

Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

Paper Structure

This paper contains 25 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Example symbolic item from the Reasoning With a Star (RWS) dataset, drawn from 2010_Lee_hw.pdfnoauthor_textbook_nodate, showing a problem, reasoning steps, and a LaTeX final expression.
  • Figure 2: General Design Philosophy of Multi-Agent System.
  • Figure 3: HMAW: Agentic Workflow Diagram.
  • Figure 4: HMAW: UML Activity Diagram (Google ADK).
  • Figure 5: HMAW: UML Sequence Diagram.
  • ...and 10 more figures