Table of Contents
Fetching ...

SAVeD: Semantic Aware Version Discovery

Artem Frenk, Roee Shraga

TL;DR

SAVeD introduces a fully unsupervised, semantically aware approach to discovering dataset versions by framing version relationships as semantics-preserving transformations and learning them through a SimCLR-style contrastive objective. The framework employs eight table-specific augmentations, a custom table-oriented tokenization, and a transformer encoder to map tables into a shared embedding space, optimized via the NT-Xent loss with temperature tuning. On the Semantic Versioning in Databases Benchmark (SDVB), SAVeD achieves superior true positive rates on most datasets and notably improved intra- versus inter-dataset separation, outperforming untrained baselines and competing with supervised methods like Starmie. The work significantly reduces manual annotation needs and enables scalable, semantics-driven discovery of dataset versions in data lakes and integrated analytics pipelines, with broad implications for reuse and provenance in data science workflows.

Abstract

Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.

SAVeD: Semantic Aware Version Discovery

TL;DR

SAVeD introduces a fully unsupervised, semantically aware approach to discovering dataset versions by framing version relationships as semantics-preserving transformations and learning them through a SimCLR-style contrastive objective. The framework employs eight table-specific augmentations, a custom table-oriented tokenization, and a transformer encoder to map tables into a shared embedding space, optimized via the NT-Xent loss with temperature tuning. On the Semantic Versioning in Databases Benchmark (SDVB), SAVeD achieves superior true positive rates on most datasets and notably improved intra- versus inter-dataset separation, outperforming untrained baselines and competing with supervised methods like Starmie. The work significantly reduces manual annotation needs and enables scalable, semantics-driven discovery of dataset versions in data lakes and integrated analytics pipelines, with broad implications for reuse and provenance in data science workflows.

Abstract

Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.

Paper Structure

This paper contains 28 sections, 10 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Table View Creation Pipeline