Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

Peng He; Zhaohui Li; Zeyuan Wang; Jinjun Xiong; Tingting Li

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

Peng He, Zhaohui Li, Zeyuan Wang, Jinjun Xiong, Tingting Li

TL;DR

What human experts notice when reviewing AI-generated evaluations of high-quality instructional materials are examined, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent.

Abstract

Designing high-quality, standards-aligned instructional materials for K--12 science is time-consuming and expertise-intensive. This study examines what human experts notice when reviewing AI-generated evaluations of such materials, aiming to translate their insights into design principles for a future GenAI-based instructional material design agent. We intentionally selected 12 high-quality curriculum units across life, physical, and earth sciences from validated programs such as OpenSciEd and Multiple Literacies in Project-based Learning. Using the EQuIP rubric with 9 evaluation items, we prompted GPT-4o, Claude, and Gemini to produce numerical ratings and written rationales for each unit, generating 648 evaluation outputs. Two science education experts independently reviewed all outputs, marking agreement (1) or disagreement (0) for both scores and rationales, and offering qualitative reflections on AI reasoning. This process surfaces patterns in where LLM judgments align with or diverge from expert perspectives, revealing reasoning strengths, gaps, and contextual nuances. These insights will directly inform the development of a domain-specific GenAI agent to support the design of high-quality instructional materials in K--12 science education.

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

TL;DR

Abstract

Paper Structure (29 sections, 6 figures, 4 tables)

This paper contains 29 sections, 6 figures, 4 tables.

Introduction
Related Work
Evaluating K--12 Science Instructional Materials
Large Language Models for Educational Assessment
Human Validation of AI-Generated Evaluations
Designing AI Agents for Curriculum Development
Methods
Data Collection
Data Analysis
Anticipated Outcomes
Experiments
Experiment setup
Evaluation Metrics
Prompt Design
Results
...and 14 more sections

Figures (6)

Figure 1: System workflow for evaluating learning activities with LLMs.
Figure 2: Example of model output and human agreement for one K– 12 lesson activity. Human Agreement is coded as 1 (expert agrees with the model) or 0 (expert disagrees).
Figure 3: Human--LLM agreement rates (%) by topic (top row) and by model (bottom row). Each bar shows Rater A (blue) and Rater B (orange).
Figure 4: Inter-rater reliability (Cohen’s $\kappa$ and Fleiss’ $\kappa$) by topic.
Figure 5: Pairwise exact matches of scores between LLMs.
...and 1 more figures

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

TL;DR

Abstract

Judging the Judges: Human Validation of Multi-LLM Evaluation for High-Quality K--12 Science Instructional Materials

Authors

TL;DR

Abstract

Table of Contents

Figures (6)