AI-assisted JSON Schema Creation and Mapping
Felix Neubauer, Jürgen Pleiss, Benjamin Uekermann
TL;DR
The paper tackles the lack of standardized domain schemas in scientific data by proposing a hybrid workflow that pairs large language models with deterministic safeguards to enable natural-language-driven JSON Schema creation and data integration mappings. Implemented in the open-source MetaConfigurator, the approach provides a chat-like interface for schema creation/modification, integrated validation/visualization, and a deterministic JSONata-based mapping pipeline for heterogeneous data formats. A chemistry application demonstrates end-to-end transformation of unstructured data into interoperable, AI-ready representations and showcases when bespoke code is needed for complex integrations like XDL. The work lowers barriers to structured data modeling and FAIR data practices, enabling non-experts to participate in schema-driven modeling and data integration at scale, with potential for future model-to-model transformations.
Abstract
Model-Driven Engineering (MDE) places models at the core of system and data engineering processes. In the context of research data, these models are typically expressed as schemas that define the structure and semantics of datasets. However, many domains still lack standardized models, and creating them remains a significant barrier, especially for non-experts. We present a hybrid approach that combines large language models (LLMs) with deterministic techniques to enable JSON Schema creation, modification, and schema mapping based on natural language inputs by the user. These capabilities are integrated into the open-source tool MetaConfigurator, which already provides visual model editing, validation, code generation, and form generation from models. For data integration, we generate schema mappings from heterogeneous JSON, CSV, XML, and YAML data using LLMs, while ensuring scalability and reliability through deterministic execution of generated mapping rules. The applicability of our work is demonstrated in an application example in the field of chemistry. By combining natural language interaction with deterministic safeguards, this work significantly lowers the barrier to structured data modeling and data integration for non-experts.
