Table of Contents
Fetching ...

DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

Vishal Kumar, Shubhra Mishra, Rebecca Hao, Rizwaan Malik, David Broman, Dorottya Demszky

TL;DR

The paper tackles the challenge of evaluating mathematical diagrams produced by educational LLMs. It introduces DiagramIR, which converts TikZ drawings into a structured intermediate representation and applies deterministic, rule-based checks via back-translation, enabling scalable and cost-effective diagram evaluation. Across a 398-item real-world dataset, DiagramIR shows higher agreement with human judgments than LLM-based judges and allows smaller models to match larger frontier models at roughly $10\times$ lower cost (e.g., $κ$ around $0.48$–$0.56$ in strongest settings). This approach provides auditable, domain-specific diagram evaluation that scales for education technology and reduces dependency on expensive models.

Abstract

Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.

DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

TL;DR

The paper tackles the challenge of evaluating mathematical diagrams produced by educational LLMs. It introduces DiagramIR, which converts TikZ drawings into a structured intermediate representation and applies deterministic, rule-based checks via back-translation, enabling scalable and cost-effective diagram evaluation. Across a 398-item real-world dataset, DiagramIR shows higher agreement with human judgments than LLM-based judges and allows smaller models to match larger frontier models at roughly lower cost (e.g., around in strongest settings). This approach provides auditable, domain-specific diagram evaluation that scales for education technology and reduces dependency on expensive models.

Abstract

Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.

Paper Structure

This paper contains 20 sections, 2 figures, 12 tables, 6 algorithms.

Figures (2)

  • Figure 1: Different evaluation approaches for TikZ-generated code. The left shows TikZ code as the common input. Top: We asked human evaluators to rate the TikZ compiled images based on the rubric discussed in Section \ref{['sec:methods_rubric']}. Middle: LLM-as-a-Judge uses either the TikZ code, the rendered image, or both to make judgments. Bottom: In our back-translation method, an LLM translates the TikZ code into an intermediate representation (IR) with multiple fields, after which automatic checks (e.g., whether the diagram is fully in canvas or whether outlines are closed) are run. A diagram is considered valid if all check pass.
  • Figure 2: Examples of abbreviated intermediate representations (IRs) extracted from TikZ diagrams. The triangle (left) and rectangular prism (right) illustrate how diagrams are mapped into structured IRs of shapes, line segments, nodes, and symbols. For clarity, only key fields are shown here; the full IR schema and detailed descriptions of all attributes are provided in the section below.