Learning to Generate Structured Output with Schema Reinforcement Learning
Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, Maosong Sun
TL;DR
This work systematically evaluates large language models’ ability to generate valid JSON under complex JSON schemas. It introduces SchemaBench, a ~40K-schema benchmark covering schema-only generation, schema-constrained reasoning, and escaping, and finds current models still struggle on complex schemas. To address this, it proposes Schema Reinforcement Learning (SRL) with a fine-grained schema validator and Thoughts of Structure (ToS), enabling online RL with three phases (Sampling, Rewarding, Updating) and achieving up to a 16% gain in valid JSON generation. The approach also demonstrates that improvements in structured generation translate to better downstream performance, including tool-calling in BFCL, while preserving general reasoning capabilities. Overall, SchemaBench and SRL offer a rigorous pathway to align LLM outputs with predefined data schemas in real-world pipelines.
Abstract
This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
