Table of Contents
Fetching ...

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

Basel Shbita, Farhan Ahmed, Chad DeLuca

TL;DR

This paper introduces MermaidSeqBench, a benchmark for evaluating LLMs in generating Mermaid sequence diagrams from natural language prompts. The benchmark combines a human-verified seed with scalable synthetic expansion via Scalable Synthetic Data Generation and deterministic rule-based variations, yielding 132 NL-Mermaid pairs. An LLM-as-a-judge framework evaluates outputs on fine-grained dimensions including syntax, activation handling, error/status tracking, and completeness across six models and two judges. Results reveal significant gaps across model families and highlight the value of multi-judge evaluation, with the dataset being open-sourced to spur further research and extension to related diagram representations such as PlantUML.

Abstract

Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

TL;DR

This paper introduces MermaidSeqBench, a benchmark for evaluating LLMs in generating Mermaid sequence diagrams from natural language prompts. The benchmark combines a human-verified seed with scalable synthetic expansion via Scalable Synthetic Data Generation and deterministic rule-based variations, yielding 132 NL-Mermaid pairs. An LLM-as-a-judge framework evaluates outputs on fine-grained dimensions including syntax, activation handling, error/status tracking, and completeness across six models and two judges. Results reveal significant gaps across model families and highlight the value of multi-judge evaluation, with the dataset being open-sourced to spur further research and extension to related diagram representations such as PlantUML.

Abstract

Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.

Paper Structure

This paper contains 13 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: A UML sequence diagram from our benchmark, illustrating the "Uploading Documents with Secure Storage" flow. Participants include the User, Mobile App, Backend For Frontend (BFF), Azure AD, Database, and Azure Blob Storage. In this scenario, the User uploads a document through the Mobile App, which forwards the file and session token to the BFF. The BFF validates the token with Azure AD, checks the user's permissions, and, if authorized, records document metadata in the Database and securely stores the file in cloud storage (Azure Blob Storage). A confirmation is then returned to the app, while alternate paths handle errors for unauthorized access or oversized files.
  • Figure 2: A UML sequence diagram from our benchmark, illustrating the "Chatbot Interaction for Customer Support" flow. Participants include the User, Mobile App, Backend For Frontend, Chatbot, and Customer Support Agent. In this scenario, the User submits a query through the Mobile App, which forwards it via the BFF to the Chatbot. The Chatbot provides initial responses and may request clarifications; if unable to resolve the issue, it escalates the conversation to a Customer Support Agent. The Agent then interacts with the User through the Mobile App to provide a resolution, after which the app collects feedback from the User.