Table of Contents
Fetching ...

Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

Yixuan Shen, Peng He, Honglu Liu, Yuyang Ji, Tingting Li, Tianlong Chen, Kaidi Xu, Feng Liu

TL;DR

SciIBI is the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices and sophistication levels, and reveals fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching.

Abstract

K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.

Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

TL;DR

SciIBI is the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices and sophistication levels, and reveals fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching.

Abstract

K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.
Paper Structure (40 sections, 3 figures, 5 tables)

This paper contains 40 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The impact of visual context on instructional practice coding.Left: Text-only models fail to detect student engagement (e.g., raising hands) from the transcript alone, misclassifying the clip as a lecture (Big Idea). Right: Multimodal models (Vision+Text) utilize visual cues to correctly identify "Eliciting Student Ideas" (D1), yielding an average accuracy improvement of 4.8% across evaluated MLLMs.
  • Figure 2: Overview of the SciIBI benchmark construction. The pipeline involves sourcing NGSS-aligned videos, temporal segmentation based on instructional activities, and consensus-based expert annotation. The final dataset features 113 clips with a naturalistic distribution across four Core Instructional Practices (CIP) and binary sophistication levels (Low vs. High).
  • Figure 3: Failure analysis of text-only models. (a) Aggregated confusion matrix reveals systematic confusion between eliciting ideas (D1) and pressing for explanation (D3). (b) Representative errors illustrate how models often rely on surface keywords (e.g., "asking for prediction") rather than the underlying pedagogical function, leading to misinterpretations of instructional intent.