AutoFAIR : Automatic Data FAIRification via Machine Reading

Tingyan Ma; Wei Liu; Bin Lu; Xiaoying Gan; Yunqiang Zhu; Luoyi Fu; Chenghu Zhou

AutoFAIR : Automatic Data FAIRification via Machine Reading

Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou

TL;DR

This work presents AutoFAIR, an automated architecture to FAIRify data by linking data/metadata operations to FAIR indicators and employing a two-stage Web Reader (DOM-based GNN node classification and LM-driven extraction) together with FAIR Alignment (ontology guidance and semantic matching) to produce machine-readable, standards-aligned metadata. A case study in mountain hazards shows substantial improvements in Findability, Accessibility, Interoperability, and Reusability, including the generation of spatiotemporal maps and searchable metadata profiles. By evaluating 7124 datasets across 512 domains, AutoFAIR demonstrates cross-domain applicability and scalable automation for data sharing and reuse, while acknowledging dependence on the source webpages' information richness. The approach enhances data discovery and reuse in practice and provides a blueprint for broader automated FAIRification across scientific domains.

Abstract

The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

AutoFAIR : Automatic Data FAIRification via Machine Reading

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 6 figures, 4 tables)

This paper contains 17 sections, 6 equations, 6 figures, 4 tables.

Introduction
Related Work
Preliminary
Method
Overall Architecture
Web Reader
Node-wise Classifier
Element-wise Extractor
Fair Alignment
Theoretical Analysis of FAIR Principles' Implementation in AutoFAIR
Data FAIRness Analysis
Datasets
The Impact of FAIRification on Dataset Fairness
Findable and Accessible
Interoperable and Resuable
...and 2 more sections

Figures (6)

Figure 1: Overview of AutoFAIR’s Architecture. The DOM tree is constructed from the data webpage HTML. In Web Reader, nodes are categorized by a graph neural network to locate metadata fields, and for nodes with long text, a language model extracts the metadata. The extracted fields are then mapped according to the FAIR principles through FAIR Alignment, resulting in a FAIR-compliant metadata profile.
Figure 2: FAIRness scores for data under four domains before and after AutoFAIR. (a) and (b) conform to type 2 metadata type, i.e., metadata nested in html structure, and FAIRness is mainly enhanced by node-wise classifier extraction; (c) and (d) conform to type 3 metadata type, and element-wise extraction is required in addition to node-wise categorization to enhance FAIRness.
Figure 3: For this descriptive text embedded in the data webpage, spatiotemporal information and institutions can be fully extracted by Web Reader.
Figure 4: FAIR-compliant metadata information can provide spatial and temporal map search.
Figure 5: Spatial Distribution of Data on "Collapse". There is more open data in Europe, America and China.
...and 1 more figures

AutoFAIR : Automatic Data FAIRification via Machine Reading

TL;DR

Abstract

AutoFAIR : Automatic Data FAIRification via Machine Reading

Authors

TL;DR

Abstract

Table of Contents

Figures (6)