Table of Contents
Fetching ...

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner

TL;DR

LLMStructBench addresses the challenge of reliable extraction of structured data from natural language into JSON by providing an open, diverse dataset and a rigorous evaluation framework. The study benchmarks 22 open-weight LLMs across five use cases (995 samples) against ground-truth JSON schemas, using complementary token-level and document-level metrics. Key finding: prompting strategy often matters more than model size for structural validity and semantic correctness, with PJ+ giving robust parseability and P offering stronger value fidelity in some models. The results guide practitioners in selecting prompting strategies and open models for ETL-like tasks and highlight remaining bottlenecks in semantic accuracy.

Abstract

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

TL;DR

LLMStructBench addresses the challenge of reliable extraction of structured data from natural language into JSON by providing an open, diverse dataset and a rigorous evaluation framework. The study benchmarks 22 open-weight LLMs across five use cases (995 samples) against ground-truth JSON schemas, using complementary token-level and document-level metrics. Key finding: prompting strategy often matters more than model size for structural validity and semantic correctness, with PJ+ giving robust parseability and P offering stronger value fidelity in some models. The results guide practitioners in selecting prompting strategies and open models for ETL-like tasks and highlight remaining bottlenecks in semantic accuracy.

Abstract

We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
Paper Structure (20 sections, 8 equations, 11 figures, 8 tables)

This paper contains 20 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Schematic of the LLMStructBench inference-time evaluation setup. Each test case provides a natural-language message, a corresponding json schema, and an example input-output pair as input to the llm. The model's generated json object is then evaluated against the gt for both syntactic validity and semantic accuracy.
  • Figure 2: Parsing success rates for the Phi3 - 3.8b model across different prompting strategies. Bars represent the count of test cases resulting in a perfectly parsed json (green), successfully extracted json from surrounding text (yellow), or a complete parsing failure (red).
  • Figure 3: Distribution of error types for the Phi3 - 3.8b model's json outputs, categorized by prompting strategy. Categories include perfect matches to gt (green), semantic mistakes (orange), critical errors (gray), and complete parsing failures (red).
  • Figure 4: Performance breakdown by model size for the Gemma3 family using the P-prompting strategy. The stacked bars illustrate the counts of perfectly correct (green), mistaken (orange), erroneous (gray), and failed (red) json extractions across model variants.
  • Figure 5: Composite scores ($F1_{\text{micro}}$ and $DOC_{\text{micro}}$) for the Gemma3 model family, demonstrating performance trends across different model sizes under the P-prompting strategy.
  • ...and 6 more figures