Table of Contents
Fetching ...

Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph

Raia Abu Ahmad, Jennifer D'Souza, Matthäus Zloch, Wolfgang Otto, Georg Rehm, Allard Oelen, Stefan Dietze, Sören Auer

TL;DR

The paper addresses the poor discoverability of research datasets due to insufficient structured metadata by introducing the ORKG-Dataset content type within the Open Research Knowledge Graph. It presents a standardized, FAIR-compliant semantic model that ties datasets to their accompanying publications using templates and RDF, demonstrated on 40 NLP information-extraction datasets. Key contributions include a dual-type design (Contribution and Dataset), a robust semantic representation of dataset facets (research problems, statistics, quality, benchmarks), and three customizable views (Bibliometric, Dataset, SOTA) to support precise discovery and comparison. The work enhances web-scale discoverability and reuse of research datasets by moving from unstructured textual descriptions to structured, interoperable knowledge graphs that reflect both dataset content and scholarly context.

Abstract

Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.

Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph

TL;DR

The paper addresses the poor discoverability of research datasets due to insufficient structured metadata by introducing the ORKG-Dataset content type within the Open Research Knowledge Graph. It presents a standardized, FAIR-compliant semantic model that ties datasets to their accompanying publications using templates and RDF, demonstrated on 40 NLP information-extraction datasets. Key contributions include a dual-type design (Contribution and Dataset), a robust semantic representation of dataset facets (research problems, statistics, quality, benchmarks), and three customizable views (Bibliometric, Dataset, SOTA) to support precise discovery and comparison. The work enhances web-scale discoverability and reuse of research datasets by moving from unstructured textual descriptions to structured, interoperable knowledge graphs that reflect both dataset content and scholarly context.

Abstract

Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.
Paper Structure (10 sections, 3 figures)

This paper contains 10 sections, 3 figures.

Figures (3)

  • Figure 1: Excerpt of a screenshot of research datasets addressing scientific IE in the ORKG comparison view with structured metadata descriptions based on a set of properties defined as the ORKG-Dataset content type. The full comparison of 40 research datasets is accessible at https://orkg.org/comparison/R280270/.
  • Figure 2: Example query to obtain a list of ground truth datasets and the tasks they address. Full query: https://tinyurl.com/query-example-1.
  • Figure 3: Example query to filter for datasets that label "Method" and "Research problem" as labeled entity types in the ground truth. Full query: https://tinyurl.com/query-example-2.