Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?
R. Alexander Knipper, Indrani Dey, Souvika Sarkar, Hari Narayanan, Sadhana Puntambekar, Santu Karmaker
TL;DR
This work tackles the challenge of aligning LLM-generated questions with instructional goals in interactive virtual labs. It proposes an instructional goal-aligned framework that grounds question generation in a semi-structured simulation representation $S_i = (instruction_goals, knowledge_units, relationships)$, plus a seven-type question taxonomy and the TELeR prompt-detail taxonomy. An automated evaluation across over 1,100 questions from 19 open-source LLMs shows larger models improve structural validity by about 37.1% and average quality by about $0.8$ points on a 5-point scale, with TELeR Levels 2–3 delivering the best balance. Findings indicate open-ended formats and relational question types support higher-order thinking and alignment with simulation goals, guiding practical deployment in K-12 settings.
Abstract
Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, >90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.
