GXJoin: Generalized Cell Transformations for Explainable Joinability
Soroush Omidvartehrani, Arash Dargahi Nobari, Davood Rafiei
TL;DR
This work tackles the challenge of joinability in tabular data under syntactic transformations by studying generalized cell transformations that map source formatting to target formatting to enable equi-joins. It introduces four generalization strategies—length invariance via relative indices, recurrence through unit repetition and removal, bidirectional source-target detection, and simplicity-based tie-breaking—along with a cluster-sampling method to curb the expanded search space. Empirical results on two real-world datasets show that these generalizations increase transformation coverage, reduce the number of required transformations, improve generalization to unseen data, and enhance end-to-end join performance, outperforming a state-of-the-art baseline. The findings offer a more explainable and scalable approach to data integration, with potential extensions to streaming data and integration with language-model-based methods.
Abstract
Describing real-world entities can vary across different sources, posing a challenge when integrating or exchanging data. We study the problem of joinability under syntactic transformations, where two columns are not equi-joinable but can become equi-joinable after some transformations. Discovering those transformations is a challenge because of the large space of possible candidates, which grows with the input length and the number of rows. Our focus is on the generality of transformations, aiming to make the relevant models applicable across various instances and domains. We explore a few generalization techniques, emphasizing those that yield transformations covering a larger number of rows and are often easier to explain. Through extensive evaluation on two real-world datasets and employing diverse metrics for measuring the coverage and simplicity of the transformations, our approach demonstrates superior performance over state-of-the-art approaches by generating fewer, simpler and hence more explainable transformations as well as improving the join performance.
