Table of Contents
Fetching ...

DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

Chumeng Liang, Jiaxuan You

TL;DR

The paper tackles the challenge of evaluating LLM-generated diagrams, which often have complex structures not well captured by traditional image-based metrics. It proposes DiagramEval, a graph-based evaluation framework that represents diagrams as text-attributed graphs and introduces Node Alignment and Path Alignment as fine-grained, explainable metrics computed from $G_{gen}$ and $G_{ref}$. The authors implement an automated SVG-to-graph extraction pipeline and benchmark against CLIP-based baselines on a CVPR2025-derived dataset, showing partial alignment with human judgments and offering insights into diagram quality beyond global similarity. This work provides a reusable benchmark and methodology for objective, interpretable assessment of scientific diagrams, with practical implications for improving LLM-driven diagram generation.

Abstract

Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

TL;DR

The paper tackles the challenge of evaluating LLM-generated diagrams, which often have complex structures not well captured by traditional image-based metrics. It proposes DiagramEval, a graph-based evaluation framework that represents diagrams as text-attributed graphs and introduces Node Alignment and Path Alignment as fine-grained, explainable metrics computed from and . The authors implement an automated SVG-to-graph extraction pipeline and benchmark against CLIP-based baselines on a CVPR2025-derived dataset, showing partial alignment with human judgments and offering insights into diagram quality beyond global similarity. This work provides a reusable benchmark and methodology for objective, interpretable assessment of scientific diagrams, with practical implications for improving LLM-driven diagram generation.

Abstract

Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

Paper Structure

This paper contains 26 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: DiagramEval framework overview. By considering both research paper context and paper diagrams as directed graphs, we use information precision and recall among nodes and edges of the generated diagram graph and those of the reference graph (from reference diagrams or paper context) to measure the generation quality of paper diagrams.
  • Figure 2: The detailed pipeline of DiagramEval framework. Intuitively, Node Alignment measures the correctly matched text elements between generated and groundtruth diagrams while Path Alignment measures the correctly matched connections upon matched elements.
  • Figure 3: Statistic results: probability density functions (PDFs) of our 6 novel metrics and 2 CLIPScore.
  • Figure 4: Correlation map of our 6 novel metrics and 2 CLIPScore metrics. Metrics of Node Alignment show considerable positive correlation with 2 CLIPScore metrics. Metrics of Path Alignment appear to be indifferent with 2 CLIPScore metrics.
  • Figure 5: Case: Low CLIPScore (Text) and high Path F1. CLIPScore (Text): 0.2558. Path F1: 1.
  • ...and 1 more figures