Table of Contents
Fetching ...

Towards the Automated Extraction and Refactoring of NoSQL Schemas from Application Code

Carlos J. Fernandez-Candel, Anthony Cleve, Jesus J. Garcia-Molina

TL;DR

The paper tackles the challenge of extracting NoSQL schemas from schemaless applications by introducing a static, model-driven pipeline. It injects source code into a language-agnostic Code model, derives a Control Flow model, and builds a DOS model that captures CRUD operations and data structure, which is then transformed into a U-Schema logical schema. A key contribution is the integration of join-removal refactoring, where the DOS insights guide field duplication to eliminate expensive joins, with automated updates to the schema, data, and code. The approach is validated through a round-trip experiment using a MongoDB-backed Node.js app, showing accurate schema recovery and actionable refactoring plans while highlighting limitations around asynchronous patterns and dynamic typing. The work blends metamodel-driven engineering with static analysis to enable schema-aware NoSQL tooling and sets the stage for future cross-language extensions and hybrid analysis strategies.

Abstract

In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model transformations. The extracted schema conforms to the U-Schema unified metamodel, which can represent both NoSQL and relational schemas. To support this process, we define a metamodel capable of representing the core elements of object-oriented languages. Application code is first injected into a code model, from which a control flow model is derived. This, in turn, enables the generation of a model representing both data access operations and the structure of stored data. From these models, the U-Schema logical schema is inferred. Additionally, the extracted information can be used to identify refactoring opportunities. We illustrate this capability through the detection of join-like query patterns and the automated application of field duplication strategies to eliminate expensive joins. All stages of the process are described in detail, and the approach is validated through a round-trip experiment in which a application using a MongoDB store is automatically generated from a predefined schema. The inferred schema is then compared to the original to assess the accuracy of the extraction process.

Towards the Automated Extraction and Refactoring of NoSQL Schemas from Application Code

TL;DR

The paper tackles the challenge of extracting NoSQL schemas from schemaless applications by introducing a static, model-driven pipeline. It injects source code into a language-agnostic Code model, derives a Control Flow model, and builds a DOS model that captures CRUD operations and data structure, which is then transformed into a U-Schema logical schema. A key contribution is the integration of join-removal refactoring, where the DOS insights guide field duplication to eliminate expensive joins, with automated updates to the schema, data, and code. The approach is validated through a round-trip experiment using a MongoDB-backed Node.js app, showing accurate schema recovery and actionable refactoring plans while highlighting limitations around asynchronous patterns and dynamic typing. The work blends metamodel-driven engineering with static analysis to enable schema-aware NoSQL tooling and sets the stage for future cross-language extensions and hybrid analysis strategies.

Abstract

In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model transformations. The extracted schema conforms to the U-Schema unified metamodel, which can represent both NoSQL and relational schemas. To support this process, we define a metamodel capable of representing the core elements of object-oriented languages. Application code is first injected into a code model, from which a control flow model is derived. This, in turn, enables the generation of a model representing both data access operations and the structure of stored data. From these models, the U-Schema logical schema is inferred. Additionally, the extracted information can be used to identify refactoring opportunities. We illustrate this capability through the detection of join-like query patterns and the automated application of field duplication strategies to eliminate expensive joins. All stages of the process are described in detail, and the approach is validated through a round-trip experiment in which a application using a MongoDB store is automatically generated from a predefined schema. The inferred schema is then compared to the original to assess the accuracy of the extraction process.

Paper Structure

This paper contains 25 sections, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Users and Movies objects in the "streaming service" document store.
  • Figure 2: Entities and Queries extracted for the FWM example.
  • Figure 3: Overview of the U-Schema extraction and join query removal approach.
  • Figure 4: Excerpt of the main elements of the Code Metamodel.
  • Figure 5: Excerpt of the Code model extracted for the running example (starting from line 9).
  • ...and 9 more figures