Table of Contents
Fetching ...

Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

Kate Lin, Tarfah Alrashed, Natasha Noy

TL;DR

The paper tackles the challenge of understanding how web-distributed datasets relate to each other, arguing that such relationships are crucial for discovery, reproducibility, and trust. It grounds a comprehensive taxonomy in user tasks and formal definitions, distinguishing provenance-based relationships (Replica, Version, Subset, Derivation, Variant) from non-provenance-based ones (Topically Similar, Task-similar, Integratable). Through an empirical study over a 2.7 million dataset corpus annotated with ground-truth pairs, it compares schema.org-based markup, heuristics, gradient-boosted trees, and LLM-based classification, finding that metadata-driven ML approaches achieve about 90% accuracy and outperform baselines. The study reveals that at least 20% of datasets have a relationship with another dataset, highlights gaps in semantic markup (schema.org) for many relationships, and argues for richer metadata and tooling to better capture provenance and context. Overall, the work sets a scalable benchmark for future research and provides a public release of a large dataset-page collection to advance dataset discovery and interoperability.

Abstract

The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.

Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web

TL;DR

The paper tackles the challenge of understanding how web-distributed datasets relate to each other, arguing that such relationships are crucial for discovery, reproducibility, and trust. It grounds a comprehensive taxonomy in user tasks and formal definitions, distinguishing provenance-based relationships (Replica, Version, Subset, Derivation, Variant) from non-provenance-based ones (Topically Similar, Task-similar, Integratable). Through an empirical study over a 2.7 million dataset corpus annotated with ground-truth pairs, it compares schema.org-based markup, heuristics, gradient-boosted trees, and LLM-based classification, finding that metadata-driven ML approaches achieve about 90% accuracy and outperform baselines. The study reveals that at least 20% of datasets have a relationship with another dataset, highlights gaps in semantic markup (schema.org) for many relationships, and argues for richer metadata and tooling to better capture provenance and context. Overall, the work sets a scalable benchmark for future research and provides a public release of a large dataset-page collection to advance dataset discovery and interoperability.

Abstract

The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.
Paper Structure (31 sections, 2 figures, 3 tables)

This paper contains 31 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The "Aquarius Official Release Level 3 Ancillary Reynolds Sea Surface Temperature Standard Mapped Image" dataset has annual, monthly, and daily variants with multiple versions. Variants are derived from each other and can have different replicas (e.g., "Annual V4" on three sites) and reconfigurations (ascending, descending).
  • Figure 2: The overall accuracy for each method type.