Table of Contents
Fetching ...

Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction

Kausik Hira, Mohd Zaki, Dhruvil Sheth, Mausam, N M Anoop Krishnan

TL;DR

This work systematically identifies and quantifies the information extraction (IE) challenges inherent in MatSci literature as researchers attempt to reconstruct the materials tetrahedron (composition, structure, properties, processing, testing) into a large materials knowledge base. It combines a 2536-paper corpus with manual annotations and domain-specific IE models (DiSCoMaT for tables and GPT-4 for text) to quantify where information tends to appear (text vs tables) and which table/text formats hinder extraction. Key findings show heavy reliance on tables for compositions and properties, frequent variability in table structures (MCC/SCC; CI/PI) and in reporting units, and critical linking challenges across multiple tables and texts. The study offers concrete guidelines for IE-friendly table design and highlights the need for integrated IE pipelines to realize a universal MatSci KB, underpinning accelerated materials discovery. The results underscore that advances in table/text joint extraction and robust inter-table linking are essential to translate literature into a usable, scalable materials knowledge base with practical scientific impact.

Abstract

The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.

Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction

TL;DR

This work systematically identifies and quantifies the information extraction (IE) challenges inherent in MatSci literature as researchers attempt to reconstruct the materials tetrahedron (composition, structure, properties, processing, testing) into a large materials knowledge base. It combines a 2536-paper corpus with manual annotations and domain-specific IE models (DiSCoMaT for tables and GPT-4 for text) to quantify where information tends to appear (text vs tables) and which table/text formats hinder extraction. Key findings show heavy reliance on tables for compositions and properties, frequent variability in table structures (MCC/SCC; CI/PI) and in reporting units, and critical linking challenges across multiple tables and texts. The study offers concrete guidelines for IE-friendly table design and highlights the need for integrated IE pipelines to realize a universal MatSci KB, underpinning accelerated materials discovery. The results underscore that advances in table/text joint extraction and robust inter-table linking are essential to translate literature into a usable, scalable materials knowledge base with practical scientific impact.

Abstract

The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
Paper Structure (23 sections, 15 figures, 1 table)

This paper contains 23 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: Quantifying challenges in information extraction from different elements of a research paper such as text, tables, and figures.
  • Figure 2: Occurrence of information regarding precursors(raw materials), compositions, properties, processing, and testing conditions in MatSci papers.
  • Figure 3: Classification of composition tables in single-cell composition (SCC) and multi-cell composition (MCC) with complete information (CI) and partial information (PI).
  • Figure 4: Example of tables: (a) mentioning nominal (batch) and analyzed composition, (b) having references to other papers
  • Figure 5: (a) Table with composition mentioned as acronyms in ID (first column). (b) The value of variable 'M' needs to be inferred from the material IDs.
  • ...and 10 more figures