Table of Contents
Fetching ...

MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

Ali Riza Durmaz, Akhil Thomas, Lokesh Mishra, Rachana Niranjan Murthy, Thomas Straub

TL;DR

MaterioMiner addresses the lack of datasets that couple ontologies with text in materials science by linking a materials_mechanics ontology to fatigue-domain publications. The paper introduces a richly annotated dataset with 2191 entities across 4 papers, 179 fine-grained classes, and a curation workflow that yields two NER benchmarks (FG-NER and CG-NER). It demonstrates feasibility of fine-tuning domain-specific LMs (MatSciBERT) for NER on these tasks and outlines a workflow to extend ontologies and construct knowledge graphs from literature. The combination of fine-grained ontological semantics with text corpora supports ontology-driven extraction of CPMP relationships and improved interpretability, with potential to improve linked data and reasoning in MSE. The dataset and ontology, released under CC-BY 4.0, provide a resource for training materials language models, automated ontology construction, and knowledge-graph generation.

Abstract

While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.

MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

TL;DR

MaterioMiner addresses the lack of datasets that couple ontologies with text in materials science by linking a materials_mechanics ontology to fatigue-domain publications. The paper introduces a richly annotated dataset with 2191 entities across 4 papers, 179 fine-grained classes, and a curation workflow that yields two NER benchmarks (FG-NER and CG-NER). It demonstrates feasibility of fine-tuning domain-specific LMs (MatSciBERT) for NER on these tasks and outlines a workflow to extend ontologies and construct knowledge graphs from literature. The combination of fine-grained ontological semantics with text corpora supports ontology-driven extraction of CPMP relationships and improved interpretability, with potential to improve linked data and reasoning in MSE. The dataset and ontology, released under CC-BY 4.0, provide a resource for training materials language models, automated ontology construction, and knowledge-graph generation.

Abstract

While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
Paper Structure (2 sections, 7 figures, 2 tables)

This paper contains 2 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Schematic showing the methodology used to generate the presented dataset. The ontology was refined manually in an iterative fashion to permit thorough annotation of materials science-related scholarly articles.
  • Figure 2: a) The top-level structure of the proposed materials mechanics ontology covers the most common entities in materials science. b) General concepts capturing composition-processing-microstructure-property relationships. Subclass relations are displayed in black while other object properties are assigned colors. Ontology prefixes are discarded for clear visual display. Instead, classes to which an equivalent class exists in PMDco are visualized with an orange background. In such cases, an owl:equivalent relation was used to express equivalence.
  • Figure 3: The hierarchy for defects is shown using point, line, planar, surface, and volume defects as superclasses for common crystallographic defects. Specific configurations such as Frank-Read source defects, i.e. pinned dislocations are also considered defects that can cause dislocation multiplication through the Frank-Read source mechanism. A text 'dislocation' is annotated as the equivalently named ontology class and additionally propagated upwards to the mm:Defect class using the mm:Dislocation $\xrightarrow[]{\text{isA}}$ mm:LineDefect $\xrightarrow[]{\text{isA}}$ mm:Defect triples.
  • Figure 4: The figure shows the ontological modeling of damage including some contextual information. The model covers concepts ranging from local small-scale plasticity to macroscopic cracks and the relations between them. For instance, the fact that slip bands typically are aligned with slip planes is described. Some relations indicate the evolution of cracks from microscopically short over physically short to long cracks, which can be distinguished by the active cracking mode and the plastic zone surrounding the crack. This is modeled after mcdowell2010microstructure.
  • Figure 5: Annotation and curation of a sample sentence. All shown annotations are at the entity level. The green, violet, and pink colors in the annotation boxes indicate accordance, deviations, and curated annotations, respectively.
  • ...and 2 more figures