DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles
Tanishq Gupta, Mohd Zaki, Devanshi Khatsuriya, Kausik Hira, N. M. Anoop Krishnan, Mausam
TL;DR
This paper tackles the problem of extracting material compositions from tables in materials science articles to enrich domain knowledge bases. It introduces DiSCoMaT, a graph-based extraction framework that uses two GNNs to classify table types and locate material IDs, constituents, and percentages, complemented by a rule-based composition parser and selective use of text from captions and the paper body for incomplete information. The dataset combines distantly supervised labels from a MatSci DB with manually annotated dev/test tables, enabling robust evaluation across table types: NC, SCC, MCC-CI, and MCC-PI. Empirical results show that DiSCoMaT significantly outperforms table-linearization baselines, with ablations confirming the value of task-specific features, caption integration, and constraint-aware training. The work provides a substantial resource (data and code) and opens avenues for end-to-end models and extension to extract additional material properties from scientific tables.
Abstract
A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several challenges in concert, such as tables that mention compositions have highly varying structures; text in captions and full paper needs to be incorporated along with data in tables; and regular languages for numbers, chemical compounds and composition expressions must be integrated into the model. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present a strong baseline DISCOMAT, that combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DISCOMAT outperforms recent table processing architectures by significant margins.
