Table of Contents
Fetching ...

AI-assisted JSON Schema Creation and Mapping

Felix Neubauer, Jürgen Pleiss, Benjamin Uekermann

TL;DR

The paper tackles the lack of standardized domain schemas in scientific data by proposing a hybrid workflow that pairs large language models with deterministic safeguards to enable natural-language-driven JSON Schema creation and data integration mappings. Implemented in the open-source MetaConfigurator, the approach provides a chat-like interface for schema creation/modification, integrated validation/visualization, and a deterministic JSONata-based mapping pipeline for heterogeneous data formats. A chemistry application demonstrates end-to-end transformation of unstructured data into interoperable, AI-ready representations and showcases when bespoke code is needed for complex integrations like XDL. The work lowers barriers to structured data modeling and FAIR data practices, enabling non-experts to participate in schema-driven modeling and data integration at scale, with potential for future model-to-model transformations.

Abstract

Model-Driven Engineering (MDE) places models at the core of system and data engineering processes. In the context of research data, these models are typically expressed as schemas that define the structure and semantics of datasets. However, many domains still lack standardized models, and creating them remains a significant barrier, especially for non-experts. We present a hybrid approach that combines large language models (LLMs) with deterministic techniques to enable JSON Schema creation, modification, and schema mapping based on natural language inputs by the user. These capabilities are integrated into the open-source tool MetaConfigurator, which already provides visual model editing, validation, code generation, and form generation from models. For data integration, we generate schema mappings from heterogeneous JSON, CSV, XML, and YAML data using LLMs, while ensuring scalability and reliability through deterministic execution of generated mapping rules. The applicability of our work is demonstrated in an application example in the field of chemistry. By combining natural language interaction with deterministic safeguards, this work significantly lowers the barrier to structured data modeling and data integration for non-experts.

AI-assisted JSON Schema Creation and Mapping

TL;DR

The paper tackles the lack of standardized domain schemas in scientific data by proposing a hybrid workflow that pairs large language models with deterministic safeguards to enable natural-language-driven JSON Schema creation and data integration mappings. Implemented in the open-source MetaConfigurator, the approach provides a chat-like interface for schema creation/modification, integrated validation/visualization, and a deterministic JSONata-based mapping pipeline for heterogeneous data formats. A chemistry application demonstrates end-to-end transformation of unstructured data into interoperable, AI-ready representations and showcases when bespoke code is needed for complex integrations like XDL. The work lowers barriers to structured data modeling and FAIR data practices, enabling non-experts to participate in schema-driven modeling and data integration at scale, with potential for future model-to-model transformations.

Abstract

Model-Driven Engineering (MDE) places models at the core of system and data engineering processes. In the context of research data, these models are typically expressed as schemas that define the structure and semantics of datasets. However, many domains still lack standardized models, and creating them remains a significant barrier, especially for non-experts. We present a hybrid approach that combines large language models (LLMs) with deterministic techniques to enable JSON Schema creation, modification, and schema mapping based on natural language inputs by the user. These capabilities are integrated into the open-source tool MetaConfigurator, which already provides visual model editing, validation, code generation, and form generation from models. For data integration, we generate schema mappings from heterogeneous JSON, CSV, XML, and YAML data using LLMs, while ensuring scalability and reliability through deterministic execution of generated mapping rules. The applicability of our work is demonstrated in an application example in the field of chemistry. By combining natural language interaction with deterministic safeguards, this work significantly lowers the barrier to structured data modeling and data integration for non-experts.

Paper Structure

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: Example of natural language schema creation: The user provides a prompt (top), which is translated into a structured visual schema (bottom).
  • Figure 2: Example of schema modification through natural language: The user requests schema changes using a prompt (top), which results in the updated schema (bottom).
  • Figure 3: Example of schema mapping: given an input JSON document and a target schema (top), the LLM generates a transformation expression (middle), which is then used to derive the structured output (bottom).