MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation
Zhenyu Wu, Jian Li, Hua Huang
TL;DR
MAGMA-Edu introduces a self-reflective, multi-agent framework for text–diagram educational question generation that unifies reasoning and diagram synthesis via a two-stage, Generate–Validate–Reflect loop and a code-based intermediate representation. The Stage 1 text generation and description refinement ensure linguistic and pedagogical quality, while Stage 2 translates descriptions into executable drawing code to render geometrically faithful diagrams with cross-modal consistency checks. On a multimodal K‑12 math benchmark, MAGMA-Edu substantially outperforms state‑of‑the‑art MLLMs in both textual quality and ITC, demonstrating large gains (e.g., GPT‑4o: Avg‑Text from ~57 to ~92 and ITC from ~13 to ~85) and achieving up to 99.12 ITC with Gemini 2.5 Pro. The framework emphasizes interpretability, verifiability, and scalable cross‑modal education, with potential to extend to broader STEM domains and curriculum design via deeper symbolic–neural integration.
Abstract
Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.
