Table of Contents
Fetching ...

TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding

Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen

TL;DR

TheoremExplainAgent introduces an agentic framework to generate long-form multimodal explanations of theorems as Manim videos, addressing the gap in multimodal theorem reasoning beyond text. The TEA pipeline uses a planner-coding duo and retrieval-augmented generation to produce structured, scene-driven videos with narration, evaluated on a 240-theorem benchmark (TEB) across four STEM fields with five metrics. Key findings show agentic planning enables coherent, lengthy explanations, while visual layout and retrieval components reveal important limitations and error modes not evident in text-only evaluations. The work underscores the value of multimodal explanations for diagnosing reasoning flaws and advancing pedagogy, while providing benchmarks, metrics, and artifacts for future development.

Abstract

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding

TL;DR

TheoremExplainAgent introduces an agentic framework to generate long-form multimodal explanations of theorems as Manim videos, addressing the gap in multimodal theorem reasoning beyond text. The TEA pipeline uses a planner-coding duo and retrieval-augmented generation to produce structured, scene-driven videos with narration, evaluated on a 240-theorem benchmark (TEB) across four STEM fields with five metrics. Key findings show agentic planning enables coherent, lengthy explanations, while visual layout and retrieval components reveal important limitations and error modes not evident in text-only evaluations. The work underscores the value of multimodal explanations for diagnosing reasoning flaws and advancing pedagogy, while providing benchmarks, metrics, and artifacts for future development.

Abstract

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

Paper Structure

This paper contains 37 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We do not have knowledge of a thing until we have grasped its cause 1901aristotle. A strong reasoning model should not only generate correct conclusions but also communicate them effectively. Visualization enhances human intuition by making abstract concepts more concrete and revealing hidden relationships. Moreover, visual explanations expose reasoning errors more clearly than text, making it easier to diagnose model mistakes.
  • Figure 2: An overview of the multimodal theorem explanation framework.
  • Figure 3: TheoremExplainAgent consists of two LLM agents. Taking a theorem as input, the planner agent create plans for execution. The coding agent then generates Python scripts to produce visuals and audio.
  • Figure 4: Subfields of TheoremExplainBench under Computer Science, Chemistry, Mathematics, and Physics.
  • Figure 5: Visualizations expose reasoning errors more clearly than text, making it easier to diagnose model mistakes.
  • ...and 5 more figures