ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
YoungHoon Jeon, Suwan Kim, Haein Son, Sookbun Lee, Yeil Jeong, Unggi Lee
TL;DR
ISD-Agent-Bench introduces a contextually rich benchmark for evaluating LLM-based Instructional Systems Design (ISD) agents. It combines a Context Matrix with 51 contextual variables across 5 categories and 33 ADDIE-derived sub-steps to generate $25{,}795$ scenarios and $1{,}202$ test cases, enabling holistic assessment via a two-stage evaluation that blends $Score = 0.7 \times ADDIE + 0.3 \times Trajectory$ with rubric-based scoring. The study presents four theory-based agents (React-ADDIE, ADDIE-Agent, Dick-Carey-Agent, RPISD-Agent) and a Baseline, finding that React-ADDIE achieves the highest mean performance of $86.49$, outperforming both pure theory-based and pipeline approaches. Across three LLMs, the results reveal that integrating classical ISD theory with ReAct-style reasoning yields superior outputs, and that theoretical quality strongly predicts benchmark performance ($r = 0.656$, $p < 0.001$), validating the benchmark as a foundation for systematic LLM-based ISD research.
Abstract
Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.
