Table of Contents
Fetching ...

Learning to Generate Structured Output with Schema Reinforcement Learning

Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, Maosong Sun

TL;DR

This work systematically evaluates large language models’ ability to generate valid JSON under complex JSON schemas. It introduces SchemaBench, a ~40K-schema benchmark covering schema-only generation, schema-constrained reasoning, and escaping, and finds current models still struggle on complex schemas. To address this, it proposes Schema Reinforcement Learning (SRL) with a fine-grained schema validator and Thoughts of Structure (ToS), enabling online RL with three phases (Sampling, Rewarding, Updating) and achieving up to a 16% gain in valid JSON generation. The approach also demonstrates that improvements in structured generation translate to better downstream performance, including tool-calling in BFCL, while preserving general reasoning capabilities. Overall, SchemaBench and SRL offer a rigorous pathway to align LLM outputs with predefined data schemas in real-world pipelines.

Abstract

This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.

Learning to Generate Structured Output with Schema Reinforcement Learning

TL;DR

This work systematically evaluates large language models’ ability to generate valid JSON under complex JSON schemas. It introduces SchemaBench, a ~40K-schema benchmark covering schema-only generation, schema-constrained reasoning, and escaping, and finds current models still struggle on complex schemas. To address this, it proposes Schema Reinforcement Learning (SRL) with a fine-grained schema validator and Thoughts of Structure (ToS), enabling online RL with three phases (Sampling, Rewarding, Updating) and achieving up to a 16% gain in valid JSON generation. The approach also demonstrates that improvements in structured generation translate to better downstream performance, including tool-calling in BFCL, while preserving general reasoning capabilities. Overall, SchemaBench and SRL offer a rigorous pathway to align LLM outputs with predefined data schemas in real-world pipelines.

Abstract

This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the data curation pipeline. We conduct multi-stage cleaning to obtain valid JSON schemas. The pie chart on the top right shows the data type distribution of the collected schemas. The top three data types are string, object, and array. The error cases in the left corner show possible errors models could make when generating JSON strings according to the given schema.
  • Figure 2: Top: snippets for three sub-tasks in Schema-only Generation. The last two snippets are special fields inserted into basic schemas like the first snippet. Bottom: corresponding common failure cases for three sub-tasks. The first one violates minLength requirement, the second one gives an incorrect base64 string and the third one gives a wrong number of backslash, causing escape error.
  • Figure 3: Statics of failure case of four models. We calculate it on the subset of the SchemaBench. All models except GPT-4o still exhibit a relatively high JSON parsing error, indicating their lack of robustness in JSON generation.
  • Figure 4: Reinforcement training accuracy on complex schema subset for LLaMA-3.2 3B. The red line is the fine-tuning baseline.
  • Figure 5: Ablation study results for LLaMA-3.2 3B. For each line, we train the model by adding a component into the ordinary RL pipelines with an outcome verifier. All results are reported with RL after $10K$ samples.