Table of Contents
Fetching ...

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

Zhenyu Wu, Jian Li, Hua Huang

TL;DR

MAGMA-Edu introduces a self-reflective, multi-agent framework for text–diagram educational question generation that unifies reasoning and diagram synthesis via a two-stage, Generate–Validate–Reflect loop and a code-based intermediate representation. The Stage 1 text generation and description refinement ensure linguistic and pedagogical quality, while Stage 2 translates descriptions into executable drawing code to render geometrically faithful diagrams with cross-modal consistency checks. On a multimodal K‑12 math benchmark, MAGMA-Edu substantially outperforms state‑of‑the‑art MLLMs in both textual quality and ITC, demonstrating large gains (e.g., GPT‑4o: Avg‑Text from ~57 to ~92 and ITC from ~13 to ~85) and achieving up to 99.12 ITC with Gemini 2.5 Pro. The framework emphasizes interpretability, verifiability, and scalable cross‑modal education, with potential to extend to broader STEM domains and curriculum design via deeper symbolic–neural integration.

Abstract

Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

TL;DR

MAGMA-Edu introduces a self-reflective, multi-agent framework for text–diagram educational question generation that unifies reasoning and diagram synthesis via a two-stage, Generate–Validate–Reflect loop and a code-based intermediate representation. The Stage 1 text generation and description refinement ensure linguistic and pedagogical quality, while Stage 2 translates descriptions into executable drawing code to render geometrically faithful diagrams with cross-modal consistency checks. On a multimodal K‑12 math benchmark, MAGMA-Edu substantially outperforms state‑of‑the‑art MLLMs in both textual quality and ITC, demonstrating large gains (e.g., GPT‑4o: Avg‑Text from ~57 to ~92 and ITC from ~13 to ~85) and achieving up to 99.12 ITC with Gemini 2.5 Pro. The framework emphasizes interpretability, verifiability, and scalable cross‑modal education, with potential to extend to broader STEM domains and curriculum design via deeper symbolic–neural integration.

Abstract

Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.

Paper Structure

This paper contains 26 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The process of generating geometric images by multimodal large models and MAGMA-Edu. After multiple rounds of human feedback, multimodal large models generate incorrect images, while MAGMA-Edu generates correct images through a two-stage iteration.
  • Figure 2: Detailed workflow of the proposed MAGMA‑Edu framework. Stage 1 (Text Generation) employs three collaborative agents—Text Generator, Text Validator, and Text Reflector—to iteratively produce, evaluate, and refine problem statements from a given prompt. Stage 2 (Image Generation) mirrors this process with Code Generator, Code Executor, Image Validator, and Image Reflector agents, which translate verified text into executable drawing code and refine it into accurate, interpretable diagrams. Both stages form a closed‑loop multimodal optimization system that outputs pedagogically aligned text–image pairs as final questions.
  • Figure 3: Effect of reflection frequency on model performance in Stage 1 and Stage 2. The average score of Stage 1 is computed as the mean of six metrics (UO, LR, QF, AA, CA, and IDQ), while Stage 2 uses the ITC metric.
  • Figure 4: Comparison between images generated based on nano-banana and those generated by code

Theorems & Definitions (2)

  • Remark 1
  • Remark 2