LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models
Sameer Sadruddin, Jennifer D'Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, Erwin Kessels
TL;DR
This work tackles the challenge of extracting machine-actionable schemas from unstructured scientific text by introducing schema-miner, a human-in-the-loop framework that leverages LLMs to generate, refine, and finalize domain schemas, culminating in ontology-grounded representations. The approach decomposes schema discovery into four stages—initial design, expert-guided refinement, broad finalization, and ontology grounding—and demonstrates practicality through a materials-science ALD use case with iterative domain expert feedback. Key contributions include a systematic, scalable methodology for LLM-assisted schema discovery, an adaptable human-in-the-loop workflow, and a concrete demonstration that yields AI-ready schemas suitable for knowledge graphs and automated reasoning. The study shows that combining automated extraction with expert oversight, across staged corpora and feedback modalities, improves semantic richness and generalizability, with potential impact across diverse scientific domains for standardization and cross-domain integration.
Abstract
Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science--specifically atomic layer deposition--schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.
