DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

Tanishq Gupta; Mohd Zaki; Devanshi Khatsuriya; Kausik Hira; N. M. Anoop Krishnan; Mausam

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

Tanishq Gupta, Mohd Zaki, Devanshi Khatsuriya, Kausik Hira, N. M. Anoop Krishnan, Mausam

TL;DR

This paper tackles the problem of extracting material compositions from tables in materials science articles to enrich domain knowledge bases. It introduces DiSCoMaT, a graph-based extraction framework that uses two GNNs to classify table types and locate material IDs, constituents, and percentages, complemented by a rule-based composition parser and selective use of text from captions and the paper body for incomplete information. The dataset combines distantly supervised labels from a MatSci DB with manually annotated dev/test tables, enabling robust evaluation across table types: NC, SCC, MCC-CI, and MCC-PI. Empirical results show that DiSCoMaT significantly outperforms table-linearization baselines, with ablations confirming the value of task-specific features, caption integration, and constraint-aware training. The work provides a substantial resource (data and code) and opens avenues for end-to-end models and extension to extract additional material properties from scientific tables.

Abstract

A crucial component in the curation of KB for a scientific domain (e.g., materials science, foods & nutrition, fuels) is information extraction from tables in the domain's published research articles. To facilitate research in this direction, we define a novel NLP task of extracting compositions of materials (e.g., glasses) from tables in materials science papers. The task involves solving several challenges in concert, such as tables that mention compositions have highly varying structures; text in captions and full paper needs to be incorporated along with data in tables; and regular languages for numbers, chemical compounds and composition expressions must be integrated into the model. We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present a strong baseline DISCOMAT, that combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DISCOMAT outperforms recent table processing architectures by significant margins.

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 9 figures, 5 tables)

This paper contains 20 sections, 4 equations, 9 figures, 5 tables.

Introduction
Related work
Challenges in composition extraction from tables
Problem formulation
Dataset construction
DiSCoMaT architecture
GNN$_1$ and GNN$_2$ for table processing
SCC Predictor
MCC-CI and MCC-PI Extractors
Constraint-aware loss functions
Experiments
Results
Conclusions
Appendix
Constraint-aware training
...and 5 more sections

Figures (9)

Figure 1: Examples of composition tables (a) Multi-cell complete-info fig_1a_mcc_all (b) Multi-cell partial-info with caption on top fig_1b_mcc_pi (c) Single-cell fig_1c_scc
Figure 2: Regexes in parser
Figure 3: The design of DiSCoMaT
Figure 4: Multi-cell composition tables (a) Complete information fig_5a (b) Partial information fig_5b
Figure 5: Confusion matrix for all table types
...and 4 more figures

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

TL;DR

Abstract

DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles

Authors

TL;DR

Abstract

Table of Contents

Figures (9)