Table of Contents
Fetching ...

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Xinyi Wu, Yanhao Jia, Qinglin Zhang, Yiran Qin, Luwei Xiao, Shuai Zhao

TL;DR

This work tackles the problem of robustly evaluating multimodal STEM PBL outcomes by introducing PBLBench and the PBL-STEM dataset, designed to operate on long-context, cross-modal student projects. It grounds evaluation criteria with Analytic Hierarchy Process (AHP) weights derived from expert judgments and benchmarks 15 MLLMs/LLMs, revealing that even top models achieve only about $59\%$ ranking accuracy and exhibit hallucinations and instability. The key contributions are the first multimodal STEM PBL dataset, an expert-validated, weighted rubric for PBL assessment, and a comprehensive benchmark that probes long-context reasoning and cross-modal integration. The work highlights the potential for AI-assisted education while underscoring current limitations and the need for robust self-verification mechanisms to reduce teacher workload and improve assessment reliability. $\{(S_i; R_i)\}_{i=0}^n = M(P, x_i)$, for all $x \in \text{PBL-STEM}$, is central to the evaluation framework, operationalizing model scores and rankings against expert standards.

Abstract

Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

TL;DR

This work tackles the problem of robustly evaluating multimodal STEM PBL outcomes by introducing PBLBench and the PBL-STEM dataset, designed to operate on long-context, cross-modal student projects. It grounds evaluation criteria with Analytic Hierarchy Process (AHP) weights derived from expert judgments and benchmarks 15 MLLMs/LLMs, revealing that even top models achieve only about ranking accuracy and exhibit hallucinations and instability. The key contributions are the first multimodal STEM PBL dataset, an expert-validated, weighted rubric for PBL assessment, and a comprehensive benchmark that probes long-context reasoning and cross-modal integration. The work highlights the potential for AI-assisted education while underscoring current limitations and the need for robust self-verification mechanisms to reduce teacher workload and improve assessment reliability. , for all , is central to the evaluation framework, operationalizing model scores and rankings against expert standards.

Abstract

Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

Paper Structure

This paper contains 13 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Schematic illustration of the PBL reviewing challenge with different representations.
  • Figure 2: The pipeline of PBLBench includes the construction of evaluation criteria, scoring by human experts, and model scoring processes.
  • Figure 3: The results compare the performance of different languages and different modalities.
  • Figure 4: The results compare the performance of different model sizes and various materials.
  • Figure 5: Performance comparison of the type of video and the different input lengths.
  • ...and 7 more figures