Comprehending Semantic Types in JSON Data with Graph Neural Networks
Shuang Wei, Michael J. Mior
TL;DR
This work extends semantic-type prediction from relational columns to nested JSON data by labeling values according to JSON Paths and treating each key-value pair as a data point. It introduces a graph-based model that converts JSON documents into graphs and processes them with a two-layer GCN, trained end-to-end to predict semantic types using features adapted from Sherlock. Experiments on Twitter and Meetup datasets show the proposed model often matches or exceeds Sherlock, particularly for multi-node structures, while yielding a smaller model and higher accuracy in several cases. The approach demonstrates the potential of graph-based representations to capture structural information in semi-structured data, with future work aimed at richer subgraph representations and broader dataset validation for improved generalization.
Abstract
Semantic types are a more powerful and detailed way of describing data than atomic types such as strings or integers. They establish connections between columns and concepts from the real world, providing more nuanced and fine-grained information that can be useful for tasks such as automated data cleaning, schema matching, and data discovery. Existing deep learning models trained on large text corpora have been successful at performing single-column semantic type prediction for relational data. However, in this work, we propose an extension of the semantic type prediction problem to JSON data, labeling the types based on JSON Paths. Similar to columns in relational data, JSON Path is a query language that enables the navigation of complex JSON data structures by specifying the location and content of the elements. We use a graph neural network to comprehend the structural information within collections of JSON documents. Our model outperforms a state-of-the-art existing model in several cases. These results demonstrate the ability of our model to understand complex JSON data and its potential usage for JSON-related data processing tasks.
