Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Xiangru Tang; Yiming Zong; Jason Phang; Yilun Zhao; Wangchunshu Zhou; Arman Cohan; Mark Gerstein

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

TL;DR

This work investigates whether large language models can reliably generate complex structured tabular data and introduces Struc-Bench, a benchmark covering text, HTML, and LaTeX tables. It introduces FormatCoT to craft format-specific instructions and two evaluation metrics, P-Score and H-Score, to rigorously assess content and format fidelity. The authors show that a structure-aware fine-tuning approach on LLaMA-7B yields substantial gains, often surpassing larger models across many metrics, supported by human evaluation and an error-analysis framework. The study lays groundwork for serious evaluation and improvement of structured-output generation in LLMs, with practical implications for automated reporting, data pipelines, and agent-driven workflows.

Abstract

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

TL;DR

Abstract

Paper Structure (48 sections, 13 figures, 5 tables)

This paper contains 48 sections, 13 figures, 5 tables.

Introduction
Problem Analysis and Benchmark
Problem Definition and Motivation
Problem Analysis
Benchmark Construction
Methodology
Data Generation
Instruction Tuning
Evaluation Metrics
P-Score
H-Score
Experiments
Basic Settings
Human Evaluation
Results
...and 33 more sections

Figures (13)

Figure 1: Overview of our workflow: We commenced by creating datasets of raw text tables, HTML tables, and LaTeX tables. Subsequently, LLaMA-7B was trained using the training data constructed by FormatCoT. Finally, our benchmarks validate the effectiveness of the current LLMs to generate such tables.
Figure 2: Error analysis by human annotation. Some error types are explained in Appendix \ref{['exampleA']}.
Figure 3: The upper-left corner box represents the original input, which notably lacks a description of the format. To explicitly instruct the model on format understanding, we employ the FormatCoT located on the right, which produces the <FORMAT INSTRUCTION>. The lower-left box illustrates what the input for LLaMA fine-tuning looks like after passing through FormatCoT. <TEXT> provides a descriptive text for the expected table output (original input), <TABLE> serves as a reference table (output), and the <FORMAT INSTRUCTION> is a format guideline generated through FormatCoT (added into input). Detailed prompts are displayed in Appendix \ref{['FormatCoTPrompt']}.
Figure 4: An exemplification of our benchmark. The input is made up of the instruction and the input text, whereas the output aims to present the target table. Notably, there are some inaccuracies in the predicted output; for instance, 'Points in 4th quarter' under 'Hawks' should be vacant, and correspondingly, 'Points in 4th quarter' for 'Magic' should be 21.
Figure 5: Visualization of LLMs' capability.
...and 8 more figures

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

TL;DR

Abstract

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)