DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning
Abhay Zala, Han Lin, Jaemin Cho, Mohit Bansal
TL;DR
DiagrammerGPT tackles open-domain diagram generation by coupling LLM-based planning with a layout-grounded diagram generator. The two-stage pipeline first uses an LLM to produce detailed diagram plans, refined through a planner-auditor feedback loop, and then DiagramGLIGEN translates those plans into accurate diagrams with explicit text rendering. The authors introduce the AI2D-Caption dataset to benchmark text-to-diagram generation and demonstrate that DiagrammerGPT outperforms strong baselines on objective diagram-quality metrics and human judgments, while enabling vector-graphic exports and human-in-the-loop editing. This work highlights the potential of combining planning-focused LLMs with layout-grounded generation to improve diagrammatic information visualization and accessibility in education and research.
Abstract
Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using structurally rich and spatially complex visualizations (e.g., a dense combination of related objects, text labels, directional arrows/lines, etc.). Existing state-of-the-art T2I models often fail at diagram generation because they lack fine-grained object layout control when many objects are densely connected via complex relations such as arrows/lines, and also often fail to render comprehensible text labels. To address this gap, we present DiagrammerGPT, a novel two-stage text-to-diagram generation framework leveraging the layout guidance capabilities of LLMs to generate more accurate diagrams. In the first stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a planner-auditor feedback loop). In the second stage, we use a diagram generator, DiagramGLIGEN, and a text label rendering module to generate diagrams (with clear text labels) following the diagram plans. To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a densely annotated diagram dataset built on top of the AI2D dataset. We show that our DiagrammerGPT framework produces more accurate diagrams, outperforming existing T2I models. We also provide comprehensive analysis, including open-domain diagram generation, multi-platform vector graphic diagram generation, human-in-the-loop editing, and multimodal planner/auditor LLMs.
