Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction
Kausik Hira, Mohd Zaki, Dhruvil Sheth, Mausam, N M Anoop Krishnan
TL;DR
This work systematically identifies and quantifies the information extraction (IE) challenges inherent in MatSci literature as researchers attempt to reconstruct the materials tetrahedron (composition, structure, properties, processing, testing) into a large materials knowledge base. It combines a 2536-paper corpus with manual annotations and domain-specific IE models (DiSCoMaT for tables and GPT-4 for text) to quantify where information tends to appear (text vs tables) and which table/text formats hinder extraction. Key findings show heavy reliance on tables for compositions and properties, frequent variability in table structures (MCC/SCC; CI/PI) and in reporting units, and critical linking challenges across multiple tables and texts. The study offers concrete guidelines for IE-friendly table design and highlights the need for integrated IE pipelines to realize a universal MatSci KB, underpinning accelerated materials discovery. The results underscore that advances in table/text joint extraction and robust inter-table linking are essential to translate literature into a usable, scalable materials knowledge base with practical scientific impact.
Abstract
The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
