Table of Contents
Fetching ...

DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models

Zhicheng Zhou, Jing Li, Suming Qiu, Junjie Huang, Linyuan Qiu, Zhijie Sun

TL;DR

DeepJSONEval tackles the gap in evaluating LLMs' ability to comprehend and extract information into deep, nested JSON structures from web data. It introduces a four-stage benchmark workflow—web-text aggregation, schema-tree construction, real-time path-value updating beam search for constrained subtrees, and ground-truth schema generation—to produce 2100 multi-domain instances with 3–7 levels of nesting. The evaluation framework measures syntax, hierarchical key matching, and strict correctness, supported by thorough ground-truthing and human-in-the-loop QA. Experimental results across 12 leading LLMs reveal significant performance gaps that widen with depth, and external validity is demonstrated via a small end-to-end web pipeline with a strong correlation to benchmark scores, confirming practical utility of the benchmark. The work provides an open-source, scalable platform for robust, domain-diverse JSON extraction assessment and sets the stage for future enhancements in deeper nesting and dynamic schema adaptation.

Abstract

The internet is saturated with low-density, high-redundancy information, such as social media comments, repetitive news, and lengthy discussions, making it difficult to extract valuable insights efficiently. Multi-layer nested JSON structures provide an effective solution by compressing such information into semantically rich, hierarchical representations, which organize data into key-value pairs, arrays, and nested objects, preserving contextual relationships and enabling efficient storage, retrieval, and semantic querying. For instance, in news aggregation, a JSON object can nest an article's metadata (title, author, date), content (text, multimedia), and multimedia information (multimedia type, caption) hierarchically. Large Language Models (LLMs) play a transformative role in web data mining by parsing unstructured text and outputting structured results directly into complex JSON schemas. However, current benchmarks for evaluating LLMs' JSON output capabilities overemphasize pure JSON generation rather than assessing data comprehension and extraction abilities, a limitation that lacks relevance to practical web data mining tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring 2100 multi-domain instances with deep nested structures, categorized by difficulty. Experiments show significant performance gaps among LLMs in handling such complexity. Our benchmark and datasets are open-sourced to advance research in structured JSON generation.(https://github.com/GTS-AI-Infra-Lab-SotaS/DeepJSONEval).

DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models

TL;DR

DeepJSONEval tackles the gap in evaluating LLMs' ability to comprehend and extract information into deep, nested JSON structures from web data. It introduces a four-stage benchmark workflow—web-text aggregation, schema-tree construction, real-time path-value updating beam search for constrained subtrees, and ground-truth schema generation—to produce 2100 multi-domain instances with 3–7 levels of nesting. The evaluation framework measures syntax, hierarchical key matching, and strict correctness, supported by thorough ground-truthing and human-in-the-loop QA. Experimental results across 12 leading LLMs reveal significant performance gaps that widen with depth, and external validity is demonstrated via a small end-to-end web pipeline with a strong correlation to benchmark scores, confirming practical utility of the benchmark. The work provides an open-source, scalable platform for robust, domain-diverse JSON extraction assessment and sets the stage for future enhancements in deeper nesting and dynamic schema adaptation.

Abstract

The internet is saturated with low-density, high-redundancy information, such as social media comments, repetitive news, and lengthy discussions, making it difficult to extract valuable insights efficiently. Multi-layer nested JSON structures provide an effective solution by compressing such information into semantically rich, hierarchical representations, which organize data into key-value pairs, arrays, and nested objects, preserving contextual relationships and enabling efficient storage, retrieval, and semantic querying. For instance, in news aggregation, a JSON object can nest an article's metadata (title, author, date), content (text, multimedia), and multimedia information (multimedia type, caption) hierarchically. Large Language Models (LLMs) play a transformative role in web data mining by parsing unstructured text and outputting structured results directly into complex JSON schemas. However, current benchmarks for evaluating LLMs' JSON output capabilities overemphasize pure JSON generation rather than assessing data comprehension and extraction abilities, a limitation that lacks relevance to practical web data mining tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring 2100 multi-domain instances with deep nested structures, categorized by difficulty. Experiments show significant performance gaps among LLMs in handling such complexity. Our benchmark and datasets are open-sourced to advance research in structured JSON generation.(https://github.com/GTS-AI-Infra-Lab-SotaS/DeepJSONEval).

Paper Structure

This paper contains 37 sections, 9 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: The representative application scenarios for multi-layer nested JSON with LLM.
  • Figure 2: The workflow of benchmark construction.
  • Figure 3: The distribution overview of DeepJSONEval across difficulty levels, domains, and categories.
  • Figure 4: The prompt length statistics for Medium and Hard samples in DeepJSONEval.
  • Figure 5: The response length distribution across evaluation scores for multiple LLMs.
  • ...and 4 more figures