Relationships are Complicated! An Analysis of Relationships Between Datasets on the Web
Kate Lin, Tarfah Alrashed, Natasha Noy
TL;DR
The paper tackles the challenge of understanding how web-distributed datasets relate to each other, arguing that such relationships are crucial for discovery, reproducibility, and trust. It grounds a comprehensive taxonomy in user tasks and formal definitions, distinguishing provenance-based relationships (Replica, Version, Subset, Derivation, Variant) from non-provenance-based ones (Topically Similar, Task-similar, Integratable). Through an empirical study over a 2.7 million dataset corpus annotated with ground-truth pairs, it compares schema.org-based markup, heuristics, gradient-boosted trees, and LLM-based classification, finding that metadata-driven ML approaches achieve about 90% accuracy and outperform baselines. The study reveals that at least 20% of datasets have a relationship with another dataset, highlights gaps in semantic markup (schema.org) for many relationships, and argues for richer metadata and tooling to better capture provenance and context. Overall, the work sets a scalable benchmark for future research and provides a public release of a large dataset-page collection to advance dataset discovery and interoperability.
Abstract
The Web today has millions of datasets, and the number of datasets continues to grow at a rapid pace. These datasets are not standalone entities; rather, they are intricately connected through complex relationships. Semantic relationships between datasets provide critical insights for research and decision-making processes. In this paper, we study dataset relationships from the perspective of users who discover, use, and share datasets on the Web: what relationships are important for different tasks? What contextual information might users want to know? We first present a comprehensive taxonomy of relationships between datasets on the Web and map these relationships to user tasks performed during dataset discovery. We develop a series of methods to identify these relationships and compare their performance on a large corpus of datasets generated from Web pages with schema.org markup. We demonstrate that machine-learning based methods that use dataset metadata achieve multi-class classification accuracy of 90%. Finally, we highlight gaps in available semantic markup for datasets and discuss how incorporating comprehensive semantics can facilitate the identification of dataset relationships. By providing a comprehensive overview of dataset relationships at scale, this paper sets a benchmark for future research.
