DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation
Vishal Kumar, Shubhra Mishra, Rebecca Hao, Rizwaan Malik, David Broman, Dorottya Demszky
TL;DR
The paper tackles the challenge of evaluating mathematical diagrams produced by educational LLMs. It introduces DiagramIR, which converts TikZ drawings into a structured intermediate representation and applies deterministic, rule-based checks via back-translation, enabling scalable and cost-effective diagram evaluation. Across a 398-item real-world dataset, DiagramIR shows higher agreement with human judgments than LLM-based judges and allows smaller models to match larger frontier models at roughly $10\times$ lower cost (e.g., $κ$ around $0.48$–$0.56$ in strongest settings). This approach provides auditable, domain-specific diagram evaluation that scales for education technology and reduces dependency on expensive models.
Abstract
Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.
