Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

R. Alexander Knipper; Indrani Dey; Souvika Sarkar; Hari Narayanan; Sadhana Puntambekar; Santu Karmaker

Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

R. Alexander Knipper, Indrani Dey, Souvika Sarkar, Hari Narayanan, Sadhana Puntambekar, Santu Karmaker

TL;DR

This work tackles the challenge of aligning LLM-generated questions with instructional goals in interactive virtual labs. It proposes an instructional goal-aligned framework that grounds question generation in a semi-structured simulation representation $S_i = (instruction_goals, knowledge_units, relationships)$, plus a seven-type question taxonomy and the TELeR prompt-detail taxonomy. An automated evaluation across over 1,100 questions from 19 open-source LLMs shows larger models improve structural validity by about 37.1% and average quality by about $0.8$ points on a 5-point scale, with TELeR Levels 2–3 delivering the best balance. Findings indicate open-ended formats and relational question types support higher-order thinking and alignment with simulation goals, guiding practical deployment in K-12 settings.

Abstract

Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, >90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.

Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

TL;DR

Abstract

Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)