Table of Contents
Fetching ...

SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, Zhen Ming, Jiang

TL;DR

SimClone tackles data clone detection in tabular datasets by shifting from formatting-based cues to value-based similarity. It builds 14 value-focused similarity features across string and numeric data, trains a binary classifier on a large synthetic clone-injected corpus, and uses SHAP to guide visualization of clone locations. Across synthetic and real-world datasets, SimClone outperforms the spreadsheet-focused LTC baseline in F1 and AUC, and its visualization significantly improves clone localization. The approach enhances data provenance, license compliance, and leakage prevention in AI data pipelines, with a practical, open-source replication package and a lighter variant for faster deployment.

Abstract

Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.

SimClone: Detecting Tabular Data Clones using Value Similarity

TL;DR

SimClone tackles data clone detection in tabular datasets by shifting from formatting-based cues to value-based similarity. It builds 14 value-focused similarity features across string and numeric data, trains a binary classifier on a large synthetic clone-injected corpus, and uses SHAP to guide visualization of clone locations. Across synthetic and real-world datasets, SimClone outperforms the spreadsheet-focused LTC baseline in F1 and AUC, and its visualization significantly improves clone localization. The approach enhances data provenance, license compliance, and leakage prevention in AI data pipelines, with a practical, open-source replication package and a lighter variant for faster deployment.

Abstract

Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
Paper Structure (38 sections, 5 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 38 sections, 5 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of SimClone.
  • Figure 2: Type-1_ injection operation.
  • Figure 3: Type-3_ injection operation.
  • Figure 4: Calculation of row-to-row & column-to-column similarity matrices.
  • Figure 5: Workflow of similarity computation and feature generation for a dataset-pair.
  • ...and 4 more figures