VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It
Xiaoxuan Zhu, Zhouhong Gu, Sihang Jiang, Zhixu Li, Hongwei Feng, Yanghua Xiao
TL;DR
This work addresses the challenge of automatic evaluation of educational video quality by reframing the task as a multi-target, multiple-choice QA problem and implementing VCEval, a framework that leverages text extraction from multimodal video content and an LLM-based evaluator. It introduces a K12 video-course benchmark, with data collection, annotation, and a three-phase training protocol (prior unlearning, in-class teaching, and in-class testing) to produce interpretable, target-aware quality scores. The framework demonstrates superior alignment with human judgments at both video and target levels, outperforming traditional text-similarity baselines and even strong ChatGPT baselines under practical input limitations. The proposed approach offers a scalable, interpretable, and fair method for guiding learners, creators, and platforms toward higher-quality video teaching materials, with potential impact on content curation and course design. Key contributions include the three-principle evaluation framework, the VCEval methodology, and the K12 benchmark with demonstrated consistency with human annotations.
Abstract
Online courses have significantly lowered the barrier to accessing education, yet the varying content quality of these videos poses challenges. In this work, we focus on the task of automatically evaluating the quality of video course content. We have constructed a dataset with a substantial collection of video courses and teaching materials. We propose three evaluation principles and design a new evaluation framework, \textit{VCEval}, based on these principles. The task is modeled as a multiple-choice question-answering task, with a language model serving as the evaluator. Our method effectively distinguishes video courses of different content quality and produces a range of interpretable results.
