Table of Contents
Fetching ...

AutoFAIR : Automatic Data FAIRification via Machine Reading

Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou

TL;DR

This work presents AutoFAIR, an automated architecture to FAIRify data by linking data/metadata operations to FAIR indicators and employing a two-stage Web Reader (DOM-based GNN node classification and LM-driven extraction) together with FAIR Alignment (ontology guidance and semantic matching) to produce machine-readable, standards-aligned metadata. A case study in mountain hazards shows substantial improvements in Findability, Accessibility, Interoperability, and Reusability, including the generation of spatiotemporal maps and searchable metadata profiles. By evaluating 7124 datasets across 512 domains, AutoFAIR demonstrates cross-domain applicability and scalable automation for data sharing and reuse, while acknowledging dependence on the source webpages' information richness. The approach enhances data discovery and reuse in practice and provides a blueprint for broader automated FAIRification across scientific domains.

Abstract

The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

AutoFAIR : Automatic Data FAIRification via Machine Reading

TL;DR

This work presents AutoFAIR, an automated architecture to FAIRify data by linking data/metadata operations to FAIR indicators and employing a two-stage Web Reader (DOM-based GNN node classification and LM-driven extraction) together with FAIR Alignment (ontology guidance and semantic matching) to produce machine-readable, standards-aligned metadata. A case study in mountain hazards shows substantial improvements in Findability, Accessibility, Interoperability, and Reusability, including the generation of spatiotemporal maps and searchable metadata profiles. By evaluating 7124 datasets across 512 domains, AutoFAIR demonstrates cross-domain applicability and scalable automation for data sharing and reuse, while acknowledging dependence on the source webpages' information richness. The approach enhances data discovery and reuse in practice and provides a blueprint for broader automated FAIRification across scientific domains.

Abstract

The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
Paper Structure (17 sections, 6 equations, 6 figures, 4 tables)

This paper contains 17 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of AutoFAIR’s Architecture. The DOM tree is constructed from the data webpage HTML. In Web Reader, nodes are categorized by a graph neural network to locate metadata fields, and for nodes with long text, a language model extracts the metadata. The extracted fields are then mapped according to the FAIR principles through FAIR Alignment, resulting in a FAIR-compliant metadata profile.
  • Figure 2: FAIRness scores for data under four domains before and after AutoFAIR. (a) and (b) conform to type 2 metadata type, i.e., metadata nested in html structure, and FAIRness is mainly enhanced by node-wise classifier extraction; (c) and (d) conform to type 3 metadata type, and element-wise extraction is required in addition to node-wise categorization to enhance FAIRness.
  • Figure 3: For this descriptive text embedded in the data webpage, spatiotemporal information and institutions can be fully extracted by Web Reader.
  • Figure 4: FAIR-compliant metadata information can provide spatial and temporal map search.
  • Figure 5: Spatial Distribution of Data on "Collapse". There is more open data in Europe, America and China.
  • ...and 1 more figures