Table of Contents
Fetching ...

Code2Video: A Code-centric Paradigm for Educational Video Generation

Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou

TL;DR

Code2Video introduces a code-centric paradigm for educational video generation, leveraging a Planner–Coder–Critic tri‑agent architecture to produce executable Manim code and controllable, auditable visualizations. The MMMC benchmark assesses multi-disciplinary educational videos along three axes: aesthetics via VLM judgments, knowledge transfer via TeachQuiz, and efficiency in code generation. Experimental results show substantial improvements over pixel-based and direct-code baselines, with agentic components notably enhancing temporal coherence, spatial clarity, and learning outcomes, occasionally approaching human-made tutorials. By using code as a unifying medium for sequential content and spatial layout, the framework offers scalable, interpretable, and extensible pathways for high-quality educational video generation and evaluation.

Abstract

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

Code2Video: A Code-centric Paradigm for Educational Video Generation

TL;DR

Code2Video introduces a code-centric paradigm for educational video generation, leveraging a Planner–Coder–Critic tri‑agent architecture to produce executable Manim code and controllable, auditable visualizations. The MMMC benchmark assesses multi-disciplinary educational videos along three axes: aesthetics via VLM judgments, knowledge transfer via TeachQuiz, and efficiency in code generation. Experimental results show substantial improvements over pixel-based and direct-code baselines, with agentic components notably enhancing temporal coherence, spatial clarity, and learning outcomes, occasionally approaching human-made tutorials. By using code as a unifying medium for sequential content and spatial layout, the framework offers scalable, interpretable, and extensible pathways for high-quality educational video generation and evaluation.

Abstract

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.

Paper Structure

This paper contains 45 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of Code2Video. A code-centric paradigm for educational video generation, where Planner ensures temporal flow, Coder bridges instructions to executable animations, and Critic refines spatial layout. Evaluation is performed on MMMC with multi-dimensional metrics.
  • Figure 2: MMMC overview. (1) Left: distribution of 13 subject categories with exemplar learning topics; ring width encodes video duration. (2) Middle: learning topic word cloud highlighting core concepts. (3) Right: average learning topic length per category.
  • Figure 3: TeachQuiz: score gap between Learning-from-Video and Unlearning stages.
  • Figure 4: Illustration of Code2Video. Given a user inquiry, Code2Video aims to render an educational video via Manim code writing: (i) the Planner converts a learning topic into a storyboard and retrieves visual assets; (ii) the Coder performs parallel code synthesis with scope-guided refinement to ensure efficiency and temporal consistency; (iii) the Critic uses visual anchor prompts to iteratively adjust spatial layout and clarity, yielding reproducible, educationally structured videos.
  • Figure 5: Illustration of visual anchor prompt ($\mathcal{P}_{\rm vis}$).
  • ...and 4 more figures