MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

Basel Shbita; Farhan Ahmed; Chad DeLuca

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

Basel Shbita, Farhan Ahmed, Chad DeLuca

TL;DR

This paper introduces MermaidSeqBench, a benchmark for evaluating LLMs in generating Mermaid sequence diagrams from natural language prompts. The benchmark combines a human-verified seed with scalable synthetic expansion via Scalable Synthetic Data Generation and deterministic rule-based variations, yielding 132 NL-Mermaid pairs. An LLM-as-a-judge framework evaluates outputs on fine-grained dimensions including syntax, activation handling, error/status tracking, and completeness across six models and two judges. Results reveal significant gaps across model families and highlight the value of multi-judge evaluation, with the dataset being open-sourced to spur further research and extension to related diagram representations such as PlantUML.

Abstract

Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

TL;DR

Abstract

MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)