Table of Contents
Fetching ...

EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education

Baoliang Chen, Xinlong Bu, Lingyu Zhu, Hanwei Zhu, Xiangjie Sui

TL;DR

A Structured 2D Mixture-of-Experts (S2D-MoE) module is proposed, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix.

Abstract

While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.

EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education

TL;DR

A Structured 2D Mixture-of-Experts (S2D-MoE) module is proposed, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix.

Abstract

While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.
Paper Structure (31 sections, 15 equations, 13 figures, 3 tables)

This paper contains 31 sections, 15 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Annotation structure of our constructed EduAIGV‑1k dataset. Each educational video is annotated with spatial and temporal fidelity and word-level semantic consistency, enabling a fine-grained assessment of perceptual quality and prompt alignment. The red and blue elliptical regions indicate temporal inconsistencies that negatively impact temporal quality.
  • Figure 2: An overview of our dataset, divided into four categories: Numbers, Geometry, Measurement, and Probability.
  • Figure 3: Annotation Analysis. (a)-(e): MOS distributions across five dimensions; (f): Average MOS of each scene category.
  • Figure 4: Overview of EduVQA framework. We jointly predict five quality dimensions via a dual-path framework equipped with 2D MoE.
  • Figure 5: Qualitative comparison of perceptual quality (top row) and prompt alignment (bottom row). We compare our EduVQA model against state-of-the-art baselines, IP-IQA and T2VQA, in each quality dimensions. In each video pair, the right video exhibits superior perceptual quality or prompt alignment compared to the left. EduVQA consistently aligns with human judgments, while IP-IQA and T2VQA produce rankings contrary to the MOS.
  • ...and 8 more figures