Table of Contents
Fetching ...

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

Bhavik Agarwal, Ishan Joshi, Viktoria Rojkova

TL;DR

The paper addresses the challenge of enforcing strict schema adherence in LLM-generated outputs within regulated biomanufacturing. It introduces ThinkJSON, a reasoning-driven reinforcement learning and fine-tuning pipeline that combines synthetic data generation, GRPO-based training, and supervised fine-tuning to enforce exact JSON schemas on a compact 1.5B-parameter model. The approach uses a JSON-based reward and a format-verification reward to directly optimize structure and formatting, achieving strong schema fidelity with modest compute. Empirical results show ThinkJSON outperforming larger DeepSeek R1 baselines and similar models in mean field accuracy and cleanliness, underscoring its practical potential for compliant, scalable text generation.

Abstract

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

TL;DR

The paper addresses the challenge of enforcing strict schema adherence in LLM-generated outputs within regulated biomanufacturing. It introduces ThinkJSON, a reasoning-driven reinforcement learning and fine-tuning pipeline that combines synthetic data generation, GRPO-based training, and supervised fine-tuning to enforce exact JSON schemas on a compact 1.5B-parameter model. The approach uses a JSON-based reward and a format-verification reward to directly optimize structure and formatting, achieving strong schema fidelity with modest compute. Empirical results show ThinkJSON outperforming larger DeepSeek R1 baselines and similar models in mean field accuracy and cleanliness, underscoring its practical potential for compliant, scalable text generation.

Abstract

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

Paper Structure

This paper contains 16 sections, 3 figures, 3 algorithms.

Figures (3)

  • Figure 2: GRPO Training Metrics
  • Figure 3: SFT Training Metrics
  • Figure 4: Performance Comparison