StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang; Dongfu Jiang; Lipeng He; Sherman Siu; Yuxuan Zhang; Disen Liao; Zhuofeng Li; Huaye Zeng; Yiming Jia; Haozhe Wang; Benjamin Schneider; Chi Ruan; Wentao Ma; Zhiheng Lyu; Yifei Wang; Yi Lu; Quy Duc Do; Ziyan Jiang; Ping Nie; Wenhu Chen

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen

TL;DR

StructEval presents a comprehensive benchmark for assessing LLMs' ability to generate and convert highly structured outputs across 18 formats and 44 task types, spanning both non-renderable and renderable outputs. It introduces two complementary subsets (StructEval-T and StructEval-V) with a unified evaluation framework based on Syntax Score, Keyword Matching Score, and VQA Score, underpinned by a three-stage annotation pipeline to ensure high-quality data. The experimental results show a persistent gap between state-of-the-art commercial models and open-source alternatives, with generation tasks generally harder than conversions and renderable outputs posing greater challenges. By enabling automated, cross-format assessment and highlighting hard subtasks, StructEval aims to drive progress in robust, format-faithful structured output generation for real-world applications.

Abstract

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

TL;DR

Abstract

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)