Table of Contents
Fetching ...

Data Augmentation Techniques for Process Extraction from Scientific Publications

Yuni Susanti

TL;DR

This work tackles the data scarcity challenge in process extraction from scientific literature by proposing targeted data augmentation methods that preserve process meaning. It introduces LSIM, PSIM/PSIM-A, and SSIM-based strategies that leverage process predicates and sentence patterns to generate meaningful augmented sentences for sequence-labeling models. Evaluated on a chemistry/materials-synthesis dataset with a Bi-LSTM CRF, the methods yield up to $12.3$ points in F1, particularly benefiting small training sets and potentially reducing overfitting. The approach shows promise for generalizing to other sequence-labeling tasks and domains beyond chemistry.

Abstract

We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.

Data Augmentation Techniques for Process Extraction from Scientific Publications

TL;DR

This work tackles the data scarcity challenge in process extraction from scientific literature by proposing targeted data augmentation methods that preserve process meaning. It introduces LSIM, PSIM/PSIM-A, and SSIM-based strategies that leverage process predicates and sentence patterns to generate meaningful augmented sentences for sequence-labeling models. Evaluated on a chemistry/materials-synthesis dataset with a Bi-LSTM CRF, the methods yield up to points in F1, particularly benefiting small training sets and potentially reducing overfitting. The approach shows promise for generalizing to other sequence-labeling tasks and domains beyond chemistry.

Abstract

We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.
Paper Structure (11 sections, 2 equations, 1 table)