Bridging Textual Data and Conceptual Models: A Model-Agnostic Structuring Approach
Jacques Chabin, Mirian Halfeld Ferrari, Nicolas Hiot
TL;DR
The paper tackles the challenge of turning unstructured textual data into structured representations compatible with multiple database models. It introduces ArchiTXT, a model-agnostic pipeline that uses an attribute grammar meta-grammar and a target grammar to iteratively transform syntax trees into a coherent, model-agnostic schema GT, along with an instance I_T. The key contributions include the formalization of G and GT, the iterative tree rewriting and grammar extraction framework, and a proof-of-concept demonstration on clinical CAS data showing grammar and schema consolidation. This work enables transparent, interoperable data structuring that can feed into various database models and downstream analytics, with a reproducible methodology and a provided implementation plan for future enhancements.
Abstract
We introduce an automated method for structuring textual data into a model-agnostic schema, enabling alignment with any database model. It generates both a schema and its instance. Initially, textual data is represented as semantically enriched syntax trees, which are then refined through iterative tree rewriting and grammar extraction, guided by the attribute grammar meta-model \metaG. The applicability of this approach is demonstrated using clinical medical cases as a proof of concept.
