Towards the Automated Extraction and Refactoring of NoSQL Schemas from Application Code
Carlos J. Fernandez-Candel, Anthony Cleve, Jesus J. Garcia-Molina
TL;DR
The paper tackles the challenge of extracting NoSQL schemas from schemaless applications by introducing a static, model-driven pipeline. It injects source code into a language-agnostic Code model, derives a Control Flow model, and builds a DOS model that captures CRUD operations and data structure, which is then transformed into a U-Schema logical schema. A key contribution is the integration of join-removal refactoring, where the DOS insights guide field duplication to eliminate expensive joins, with automated updates to the schema, data, and code. The approach is validated through a round-trip experiment using a MongoDB-backed Node.js app, showing accurate schema recovery and actionable refactoring plans while highlighting limitations around asynchronous patterns and dynamic typing. The work blends metamodel-driven engineering with static analysis to enable schema-aware NoSQL tooling and sets the stage for future cross-language extensions and hybrid analysis strategies.
Abstract
In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model transformations. The extracted schema conforms to the U-Schema unified metamodel, which can represent both NoSQL and relational schemas. To support this process, we define a metamodel capable of representing the core elements of object-oriented languages. Application code is first injected into a code model, from which a control flow model is derived. This, in turn, enables the generation of a model representing both data access operations and the structure of stored data. From these models, the U-Schema logical schema is inferred. Additionally, the extracted information can be used to identify refactoring opportunities. We illustrate this capability through the detection of join-like query patterns and the automated application of field duplication strategies to eliminate expensive joins. All stages of the process are described in detail, and the approach is validated through a round-trip experiment in which a application using a MongoDB store is automatically generated from a predefined schema. The inferred schema is then compared to the original to assess the accuracy of the extraction process.
