Table of Contents
Fetching ...

Advancing Scientific Knowledge Retrieval and Reuse with a Novel Digital Library for Machine-Readable Knowledge

Hadi Ghaemi, Lauren Snyder, Markus Stocker

TL;DR

The paper addresses the limitation that current digital libraries are document-centric and not readily machine-readable, hindering synthesis-based reuse. It introduces ORKG reborn, a three-layer digital library that publishes machine-readable scientific knowledge as reborn articles with statements and supporting evidence linked to data and code. The architecture combines a Data Type Registry and RO-Crates for data deposition, Elasticsearch and Faiss for storage and search, and a hybrid retrieval approach with dense vectors, keyword search, and cross-encoder re-ranking. This approach improves transparency, reproducibility, and reuse, enabling novel information retrieval for synthesis and cross-domain knowledge integration; future work expands knowledge types and supports synthesis use cases.

Abstract

Digital libraries for research, such as the ACM Digital Library or Semantic Scholar, do not enable the machine-supported, efficient reuse of scientific knowledge (e.g., in synthesis research). This is because these libraries are based on document-centric models with narrative text knowledge expressions that require manual or semi-automated knowledge extraction, structuring, and organization. We present ORKG reborn, an emerging digital library that supports finding, accessing, and reusing accurate, fine-grained, and reproducible machine-readable expressions of scientific knowledge that relate scientific statements and their supporting evidence in terms of data and code. The rich expressions of scientific knowledge are published as reborn (born-reusable) articles and provide novel possibilities for scientific knowledge retrieval, for instance by statistical methods, software packages, variables, or data matching specific constraints. We describe the proposed system and demonstrate its practical viability and potential for information retrieval in contrast to state-of-the-art digital libraries and document-centric scholarly communication using several published articles in research fields ranging from computer science to soil science. Our work underscores the enormous potential of scientific knowledge databases and a viable approach to their construction.

Advancing Scientific Knowledge Retrieval and Reuse with a Novel Digital Library for Machine-Readable Knowledge

TL;DR

The paper addresses the limitation that current digital libraries are document-centric and not readily machine-readable, hindering synthesis-based reuse. It introduces ORKG reborn, a three-layer digital library that publishes machine-readable scientific knowledge as reborn articles with statements and supporting evidence linked to data and code. The architecture combines a Data Type Registry and RO-Crates for data deposition, Elasticsearch and Faiss for storage and search, and a hybrid retrieval approach with dense vectors, keyword search, and cross-encoder re-ranking. This approach improves transparency, reproducibility, and reuse, enabling novel information retrieval for synthesis and cross-domain knowledge integration; future work expands knowledge types and supports synthesis use cases.

Abstract

Digital libraries for research, such as the ACM Digital Library or Semantic Scholar, do not enable the machine-supported, efficient reuse of scientific knowledge (e.g., in synthesis research). This is because these libraries are based on document-centric models with narrative text knowledge expressions that require manual or semi-automated knowledge extraction, structuring, and organization. We present ORKG reborn, an emerging digital library that supports finding, accessing, and reusing accurate, fine-grained, and reproducible machine-readable expressions of scientific knowledge that relate scientific statements and their supporting evidence in terms of data and code. The rich expressions of scientific knowledge are published as reborn (born-reusable) articles and provide novel possibilities for scientific knowledge retrieval, for instance by statistical methods, software packages, variables, or data matching specific constraints. We describe the proposed system and demonstrate its practical viability and potential for information retrieval in contrast to state-of-the-art digital libraries and document-centric scholarly communication using several published articles in research fields ranging from computer science to soil science. Our work underscores the enormous potential of scientific knowledge databases and a viable approach to their construction.

Paper Structure

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Proposed system architecture showing the three main layers: Data Deposition and Collection Layer, Knowledge Organization Layer, and Presentation Layer.
  • Figure 2: Diagram of the 'Data Preprocessing' data type describing the executed procedure, the utilized input data, and produced output data.
  • Figure 3: Overview of the RO-Crate metadata file structure.
  • Figure 4: Scientific statements and supporting evidence as originally published by Gentsch et al. gentsch2024cover presented here as a reborn article accessible in the ORKG reborn digital library. (left) A reborn article presenting the original research findings as structured scientific statements and supporting evidence (https://doi.org/10.48366/5eqe8313). (right) Display of a scientific statement and supporting evidence in terms of a data analysis described by the executed procedure, utilized input data, produced output data, and full implementation in source code.