Large Language Models for JSON Schema Discovery
Michael J. Mior
TL;DR
The paper tackles the problem that JSON schema discovery often yields structurally correct but semantically weak schemas. It introduces a three-pronged approach powered by large language models: generating natural language descriptions for schema elements, producing semantically meaningful names for repeated definitions, and a classifier to select which properties are genuinely useful. Trained on 657 JSON Schemas from the JSON Schema Store and fine-tuned with Code Llama (via LoRA), the method demonstrates superior performance on standard text-generation metrics (e.g., BERTScore, ROUGE-L, BLEU), semantic naming (VarCLR), and property selection accuracy (~90.5%). The results suggest that semantic augmentation can meaningfully improve the usability and interoperability of automatically discovered schemas, and the approach can be integrated with existing schema discovery tools. Future work may broaden context usage and allow users to adjust the level of detail in the produced schemas.
Abstract
Semi-structured data formats such as JSON have proved to be useful data models for applications that require flexibility in the format of data stored. However, JSON data often come without the schemas that are typically available with relational data. This has resulted in a number of tools for discovering schemas from a collection of data. Although such tools can be useful, existing approaches focus on the syntax of documents and ignore semantic information. In this work, we explore the automatic addition of meaningful semantic information to discovered schemas similar to information that is added by human schema authors. We leverage large language models and a corpus of manually authored JSON Schema documents to generate natural language descriptions of schema elements, meaningful names for reusable definitions, and identify which discovered properties are most useful and which can be considered "noise". Our approach performs well on existing metrics for text generation that have been previously shown to correlate well with human judgement.
