Table of Contents
Fetching ...

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

Sameer Sadruddin, Jennifer D'Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, Erwin Kessels

TL;DR

This work tackles the challenge of extracting machine-actionable schemas from unstructured scientific text by introducing schema-miner, a human-in-the-loop framework that leverages LLMs to generate, refine, and finalize domain schemas, culminating in ontology-grounded representations. The approach decomposes schema discovery into four stages—initial design, expert-guided refinement, broad finalization, and ontology grounding—and demonstrates practicality through a materials-science ALD use case with iterative domain expert feedback. Key contributions include a systematic, scalable methodology for LLM-assisted schema discovery, an adaptable human-in-the-loop workflow, and a concrete demonstration that yields AI-ready schemas suitable for knowledge graphs and automated reasoning. The study shows that combining automated extraction with expert oversight, across staged corpora and feedback modalities, improves semantic richness and generalizability, with potential impact across diverse scientific domains for standardization and cross-domain integration.

Abstract

Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

TL;DR

This work tackles the challenge of extracting machine-actionable schemas from unstructured scientific text by introducing schema-miner, a human-in-the-loop framework that leverages LLMs to generate, refine, and finalize domain schemas, culminating in ontology-grounded representations. The approach decomposes schema discovery into four stages—initial design, expert-guided refinement, broad finalization, and ontology grounding—and demonstrates practicality through a materials-science ALD use case with iterative domain expert feedback. Key contributions include a systematic, scalable methodology for LLM-assisted schema discovery, an adaptable human-in-the-loop workflow, and a concrete demonstration that yields AI-ready schemas suitable for knowledge graphs and automated reasoning. The study shows that combining automated extraction with expert oversight, across staged corpora and feedback modalities, improves semantic richness and generalizability, with potential impact across diverse scientific domains for standardization and cross-domain integration.

Abstract

Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.

Paper Structure

This paper contains 19 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the LLMs4SchemaDiscovery workflow implemented in https://github.com/sciknoworg/schema-miner. Stage 1 (gray box) generates an initial schema from domain specifications. Stage 2 (orange box) refines the schema using a small, expert-curated set of papers and optional feedback. Stage 3 (red box) finalizes the schema with a larger, non-curated collection of papers. The workflow iteratively updates the schema and concludes by grounding schema properties to ontologies using an ontology lookup service API.
  • Figure 2: An ALD cycle of two half-reactions: precursor addition and surface reaction, followed by co-reactant intro., with purge phases ensuring clean, controlled growth vos2019atomic.
  • Figure 3: A UMLS diagram of the best https://orkg.org/template/R796110 from schema-miner.