The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

Jonas Wilinski

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

Jonas Wilinski

TL;DR

The Science Data Lake is presented, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas.

Abstract

Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

TL;DR

Abstract

threshold) with

at the recommended

operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson

), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

Paper Structure (22 sections, 8 figures, 7 tables)

This paper contains 22 sections, 8 figures, 7 tables.

Background & Summary
Methods
Data Sources
Architecture
DOI Normalization and Record Linkage
Embedding-Based Ontology Alignment
Data Records
Technical Validation
DOI and Schema Integrity
Cross-Source Citation Agreement
Ontology Alignment Validation
Temporal Coverage Guardrails
Known Limitations
Usage Notes
Setup
...and 7 more sections

Figures (8)

Figure 1: Temporal coverage by source (symlog scale). Publication-year distributions for DOI-linked records in the unified index (see Table \ref{['tab:sources']} for full source sizes). The symlog y-axis is linear near zero and logarithmic above, revealing both historical depth and recent growth. OpenAlex, S2AG, and SciSciNet extend back to 1900 with tens of thousands of papers per year, while all three show clear acceleration after $\sim$1960. SciSciNet exhibits a sharp cutoff after $\sim$2022 when its metrics computation ends. The specialized sources (PWC, Retraction Watch) cover narrow temporal windows concentrated in the last two decades.
Figure 2: Architecture of the Science Data Lake. Eight open scholarly data sources (left) are converted to Apache Parquet format ($\sim$960 GB) and exposed as SQL views through a lightweight DuckDB database (center). Each source retains its native schema for source-level fidelity. The cross-referencing xref schema (orange) links records via DOI normalization (unified_papers, 293M rows) and connects OpenAlex topics to 13 scientific ontologies (right) through BGE-large embedding-based alignment.
Figure 3: UpSet plot showing the intersection structure across six data sources. Bars represent the number of papers in each source combination. Of 34 observed combinations, the three-way overlap of OpenAlex, SciSciNet, and S2AG accounts for the largest multi-source intersection.
Figure 4: UMAP projection of BGE-large embeddings for OpenAlex topics (points) and matched ontology terms (crosses), colored by OpenAlex domain. Semantic clusters emerge naturally, with domain-specific ontology terms co-locating with their corresponding topics.
Figure 5: Ontology reach heatmap showing the number of high-quality mappings ($\text{similarity} \geq 0.85$) between each ontology and each OpenAlex domain. The multi-ontology design ensures coverage across all scientific areas.
...and 3 more figures

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

TL;DR

Abstract

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (8)