Table of Contents
Fetching ...

Bridging Textual Data and Conceptual Models: A Model-Agnostic Structuring Approach

Jacques Chabin, Mirian Halfeld Ferrari, Nicolas Hiot

TL;DR

The paper tackles the challenge of turning unstructured textual data into structured representations compatible with multiple database models. It introduces ArchiTXT, a model-agnostic pipeline that uses an attribute grammar meta-grammar and a target grammar to iteratively transform syntax trees into a coherent, model-agnostic schema GT, along with an instance I_T. The key contributions include the formalization of G and GT, the iterative tree rewriting and grammar extraction framework, and a proof-of-concept demonstration on clinical CAS data showing grammar and schema consolidation. This work enables transparent, interoperable data structuring that can feed into various database models and downstream analytics, with a reproducible methodology and a provided implementation plan for future enhancements.

Abstract

We introduce an automated method for structuring textual data into a model-agnostic schema, enabling alignment with any database model. It generates both a schema and its instance. Initially, textual data is represented as semantically enriched syntax trees, which are then refined through iterative tree rewriting and grammar extraction, guided by the attribute grammar meta-model \metaG. The applicability of this approach is demonstrated using clinical medical cases as a proof of concept.

Bridging Textual Data and Conceptual Models: A Model-Agnostic Structuring Approach

TL;DR

The paper tackles the challenge of turning unstructured textual data into structured representations compatible with multiple database models. It introduces ArchiTXT, a model-agnostic pipeline that uses an attribute grammar meta-grammar and a target grammar to iteratively transform syntax trees into a coherent, model-agnostic schema GT, along with an instance I_T. The key contributions include the formalization of G and GT, the iterative tree rewriting and grammar extraction framework, and a proof-of-concept demonstration on clinical CAS data showing grammar and schema consolidation. This work enables transparent, interoperable data structuring that can feed into various database models and downstream analytics, with a reproducible methodology and a provided implementation plan for future enhancements.

Abstract

We introduce an automated method for structuring textual data into a model-agnostic schema, enabling alignment with any database model. It generates both a schema and its instance. Initially, textual data is represented as semantically enriched syntax trees, which are then refined through iterative tree rewriting and grammar extraction, guided by the attribute grammar meta-model \metaG. The applicability of this approach is demonstrated using clinical medical cases as a proof of concept.

Paper Structure

This paper contains 16 sections, 7 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Architecture of ArchiTXT
  • Figure 2: Example of derivation
  • Figure 3: Named-entity incorporation in a syntax tree and simplifications (example)
  • Figure 4: Iterative process for automatic structuring
  • Figure 5: Example of quotient tree
  • ...and 5 more figures

Theorems & Definitions (9)

  • definition 1
  • definition 2
  • definition 3
  • definition 4
  • definition 5: Equivalence classes
  • Remark 1
  • definition 6: Support and Confidence Score
  • definition 7: Dependency Score
  • definition 8: Redundancy Score