Table of Contents
Fetching ...

Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention

Mengzhe Hei, Zhouran Zhang, Qingbao Liu, Yan Pan, Xiang Zhao, Yongqian Peng, Yicong Ye, Xin Zhang, Shuxin Bai

TL;DR

This work tackles the scarcity of robust multi-tuple extraction from materials science literature by proposing a two-stage approach that first uses MatSciBERT with a pointer network for precise entity extraction and then applies an inter- and intra-entity attention-based allocation model to assemble complete tuples. The framework operates on a curated corpus of 255 sentences with 568 tuples across five entity types, achieving high $F1$ scores for 1–4 tuples ($0.963$, $0.947$, $0.848$, $0.753$) and $0.854$ on a random set, while outperforming four large language models in a prompt-based setting. Ablation studies confirm the critical roles of allocation and both attention mechanisms, and discussions compare the approach to traditional NER-RE pipelines, highlighting reduced hallucination risk and greater precision for structured scientific data. Overall, the method provides a scalable, domain-specific alternative to LLMs for extracting precise, structured material properties, enabling more reliable data for data-driven materials design.

Abstract

Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.

Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention

TL;DR

This work tackles the scarcity of robust multi-tuple extraction from materials science literature by proposing a two-stage approach that first uses MatSciBERT with a pointer network for precise entity extraction and then applies an inter- and intra-entity attention-based allocation model to assemble complete tuples. The framework operates on a curated corpus of 255 sentences with 568 tuples across five entity types, achieving high scores for 1–4 tuples (, , , ) and on a random set, while outperforming four large language models in a prompt-based setting. Ablation studies confirm the critical roles of allocation and both attention mechanisms, and discussions compare the approach to traditional NER-RE pipelines, highlighting reduced hallucination risk and greater precision for structured scientific data. Overall, the method provides a scalable, domain-specific alternative to LLMs for extracting precise, structured material properties, enabling more reliable data for data-driven materials design.

Abstract

Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.

Paper Structure

This paper contains 12 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Common patterns of multiple entities and multiple relations in multi-principal-element alloys. We use three simple sentences to exemplify three common repetition patterns. However, in real-world scenarios and our dataset, the input text comprises multiple sentences. a An example of multiple properties of the same material. b An example of multiple property values of the same material and different condition values. c An example of multiple property values of the same property and different materials.
  • Figure 2: The proportion of sentences containing different numbers of tuples. The numbers below the pie chart indicate the number of tuples contained in each colored segment of the pie chart. On the right side of the pie chart are examples of varying number of tuples within one sentence. The proportions, in ascending order, are 24.36%, 33.09%, 23.64%, 11.64%, and 7.27%.
  • Figure 3: The workflow and the framework of the proposed model for extracting and allocating entities. The workflow is presented in the upper section of the figure, beginning with the retrieval of full-text research articles from Elsevier, followed by the construction of a specialized corpus, from which we extract and annotate sentences to obtain a JSON-formatted dataset, and ends with model training and inference. The model is primarily composed of two components: entity extraction and entity allocation. aEntity Extraction: This component integrates MatSciBERT and a pointer network. MatSciBERT first tokenizes the input sentence and generates vector representations for each token. The pointer network then computes the probability of each token serving as the head or tail token of a specific entity, thereby identifying entities based on these probabilities. bEntity Allocation: This component assesses whether entities of different types belong to the same tuple through an entity matching score matrix. During model inference, we can enhance the matching likelihood of entities in corresponding order by multiplying the diagonal elements of the matrix by a parameter. cEntity Matching Score Matrix: Each element in the matrix represents a combination of six vectors. The first two vectors correspond to the vector representations of the two entities, while the remaining four vectors are derived from two attention mechanisms: intra-entity and inter-entity attention. Intra-entity attention focuses on the attention distribution among different types of entities, whereas inter-entity attention concentrates on the attention distribution within the same type of entity. dInference: A four tuple extraction example.
  • Figure 4: The F1 scores of the purposed and baseline models. A total of four baseline models are employed, comprising four strong large language models. Among all the models, the one we proposed achieved the best performance.
  • Figure 5: The F1 scores of the purposed and variant models. Among all the models, the one we proposed achieved the best performance.