Table of Contents
Fetching ...

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, Hehe Fan

TL;DR

SciEducator introduces an iterative, Deming Cycle–driven multi-agent system for scientific video understanding and education, addressing the need for external knowledge integration and rigorous planning. A planner and evaluator orchestrate a planning-do-evaluation loop, drawing on internal and external knowledge sources to produce accurate interpretations and multimodal educational content. The system also generates child-friendly e-booklets and relies on the SciVBench benchmark of 500 expert-validated QA pairs across physics, chemistry, and daily-life phenomena to quantify understanding and educational capabilities. Experimental results show SciEducator outperforms leading closed-source LLMs and prior multi-agent systems in both understanding and education, suggesting a new paradigm for domain-specific video understanding and pedagogy.

Abstract

Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

TL;DR

SciEducator introduces an iterative, Deming Cycle–driven multi-agent system for scientific video understanding and education, addressing the need for external knowledge integration and rigorous planning. A planner and evaluator orchestrate a planning-do-evaluation loop, drawing on internal and external knowledge sources to produce accurate interpretations and multimodal educational content. The system also generates child-friendly e-booklets and relies on the SciVBench benchmark of 500 expert-validated QA pairs across physics, chemistry, and daily-life phenomena to quantify understanding and educational capabilities. Experimental results show SciEducator outperforms leading closed-source LLMs and prior multi-agent systems in both understanding and education, suggesting a new paradigm for domain-specific video understanding and pedagogy.

Abstract

Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

Paper Structure

This paper contains 43 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: SciEducator, conducting video comprehension and delivering science education, can generate multimodal educational e‑booklets that provide comprehensive, detailed, and engaging guidance.
  • Figure 2: SciEducator architecture overview. We design a multi-agent system capable of implementing the PDSA cycle, which iteratively optimizes output responses through cyclic iterations.
  • Figure 3: Proportional distribution of video and QA categories. SciVBench comprises three types of scientific videos and five categories of questions, enabling a comprehensive evaluation of a model’s ability to acquire diverse domain knowledge and tackle various complex scientific problems.
  • Figure 4: Qualitative comparison between SciEducator and MLLMs in Education E-booklet Generation. The left is our generated e-booklet with comprehensive contents and a well-organized structure, while other popular MLLMs (see the right) all fail in such generation. These examples demonstrate SciEducator's ability to generate more comprehensive, better-structured, and more attractive Education E-booklet.
  • Figure 5: Qualitative comparison between SciEducator and MLLMs. These examples demonstrate SciEducator's ability to generate more comprehensive, better-structured, and more logically coherent answers than the other MLLMs.
  • ...and 10 more figures