SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang; Gopi Krishnan Rajbahadur; Dayi Lin; Shaowei Wang; Zhen Ming; Jiang

SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, Zhen Ming, Jiang

TL;DR

SimClone tackles data clone detection in tabular datasets by shifting from formatting-based cues to value-based similarity. It builds 14 value-focused similarity features across string and numeric data, trains a binary classifier on a large synthetic clone-injected corpus, and uses SHAP to guide visualization of clone locations. Across synthetic and real-world datasets, SimClone outperforms the spreadsheet-focused LTC baseline in F1 and AUC, and its visualization significantly improves clone localization. The approach enhances data provenance, license compliance, and leakage prevention in AI data pipelines, with a practical, open-source replication package and a lighter variant for faster deployment.

Abstract

Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.

SimClone: Detecting Tabular Data Clones using Value Similarity

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 38 sections, 5 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Definition of Data Clone
Related Work
Data quality issues
Data Clone Detection in Spreadsheets
Detecting Clones and Duplicates in other Software Artifacts
Methodology
Synthetic dataset creation
Similarity computation
Value-based similarity metrics
Similarity matrices calculation
Feature generation
Data clone detection classifier construction and inference
Data clone visualization
Experiment Design
...and 23 more sections

Figures (9)

Figure 1: Overview of SimClone.
Figure 2: Type-1_ injection operation.
Figure 3: Type-3_ injection operation.
Figure 4: Calculation of row-to-row & column-to-column similarity matrices.
Figure 5: Workflow of similarity computation and feature generation for a dataset-pair.
...and 4 more figures

SimClone: Detecting Tabular Data Clones using Value Similarity

TL;DR

Abstract

SimClone: Detecting Tabular Data Clones using Value Similarity

Authors

TL;DR

Abstract

Table of Contents

Figures (9)