Table of Contents
Fetching ...

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

YoungHoon Jeon, Suwan Kim, Haein Son, Sookbun Lee, Yeil Jeong, Unggi Lee

TL;DR

ISD-Agent-Bench introduces a contextually rich benchmark for evaluating LLM-based Instructional Systems Design (ISD) agents. It combines a Context Matrix with 51 contextual variables across 5 categories and 33 ADDIE-derived sub-steps to generate $25{,}795$ scenarios and $1{,}202$ test cases, enabling holistic assessment via a two-stage evaluation that blends $Score = 0.7 \times ADDIE + 0.3 \times Trajectory$ with rubric-based scoring. The study presents four theory-based agents (React-ADDIE, ADDIE-Agent, Dick-Carey-Agent, RPISD-Agent) and a Baseline, finding that React-ADDIE achieves the highest mean performance of $86.49$, outperforming both pure theory-based and pipeline approaches. Across three LLMs, the results reveal that integrating classical ISD theory with ReAct-style reasoning yields superior outputs, and that theoretical quality strongly predicts benchmark performance ($r = 0.656$, $p < 0.001$), validating the benchmark as a foundation for systematic LLM-based ISD research.

Abstract

Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

TL;DR

ISD-Agent-Bench introduces a contextually rich benchmark for evaluating LLM-based Instructional Systems Design (ISD) agents. It combines a Context Matrix with 51 contextual variables across 5 categories and 33 ADDIE-derived sub-steps to generate scenarios and test cases, enabling holistic assessment via a two-stage evaluation that blends with rubric-based scoring. The study presents four theory-based agents (React-ADDIE, ADDIE-Agent, Dick-Carey-Agent, RPISD-Agent) and a Baseline, finding that React-ADDIE achieves the highest mean performance of , outperforming both pure theory-based and pipeline approaches. Across three LLMs, the results reveal that integrating classical ISD theory with ReAct-style reasoning yields superior outputs, and that theoretical quality strongly predicts benchmark performance (, ), validating the benchmark as a foundation for systematic LLM-based ISD research.

Abstract

Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.
Paper Structure (64 sections, 8 equations, 3 figures, 15 tables)

This paper contains 64 sections, 8 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Overview of ISD-Agent-Bench. Top is the Context Matrix framework that combines 51 contextual variables across 5 categories (Learner Characteristics, Institution, Domain, Delivery, Constraints) with the ISD Matrix containing 33 ADDIE-based sub-steps for systematic scenario generation. Bottom left is the data generation pipeline from abstract collection through context variation to final dataset. Bottom right is the ISD theory-based agent architecture integrating classical ISD theories with ReAct-style reasoning to produce comprehensive ISD outputs.
  • Figure 2: Theoretical quality analysis. Left panel shows correlation between theoretical quality scores and benchmark performance ($r = 0.656$, $p < 0.001$). Middle panel compares theory-based and non-theory agents across all 9 dimensions, where asterisks indicate significant differences ($d > 0.2$). Right panel displays inter-dimension correlation matrix revealing cross-framework relationships.
  • Figure 3: Agent comparison and framework validation. Left panel presents agent-wise MPI and DC scores with theory-based agents marked, where React-ADDIE leads both dimensions. Middle panel shows theoretical quality by difficulty level, demonstrating both agent types maintain stable performance. Right panel illustrates MPI-DC correlation ($r = 0.892$), confirming the frameworks capture related but complementary quality aspects.