Table of Contents
Fetching ...

PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction

Yang Zhou, Shimin Shan, Hongkui Wei, Zhehuan Zhao, Wenshuo Feng

TL;DR

PGA introduces two LLM-based data augmentation strategies—paraphrasing and generating—to enhance scientific domain relation extraction by producing labeled pseudo-samples that augment the original data. The framework demonstrates consistent F1 gains across multiple backbone RE models on the SciERC dataset, with paraphrasing generally offering higher fidelity and better performance than generation. While combining both pseudo-sample types can sometimes help, it often introduces noise that degrades results, and using pseudo-samples alone is ineffective. Overall, PGA provides a practical approach to reduce labeling costs and boost RE performance in specialized domains through careful prompt design and data post-processing.

Abstract

Relation Extraction (RE) aims at recognizing the relation between pairs of entities mentioned in a text. Advances in LLMs have had a tremendous impact on NLP. In this work, we propose a textual data augmentation framework called PGA for improving the performance of models for RE in the scientific domain. The framework introduces two ways of data augmentation, utilizing a LLM to obtain pseudo-samples with the same sentence meaning but with different representations and forms by paraphrasing the original training set samples. As well as instructing LLM to generate sentences that implicitly contain information about the corresponding labels based on the relation and entity of the original training set samples. These two kinds of pseudo-samples participate in the training of the RE model together with the original dataset, respectively. The PGA framework in the experiment improves the F1 scores of the three mainstream models for RE within the scientific domain. Also, using a LLM to obtain samples can effectively reduce the cost of manually labeling data.

PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction

TL;DR

PGA introduces two LLM-based data augmentation strategies—paraphrasing and generating—to enhance scientific domain relation extraction by producing labeled pseudo-samples that augment the original data. The framework demonstrates consistent F1 gains across multiple backbone RE models on the SciERC dataset, with paraphrasing generally offering higher fidelity and better performance than generation. While combining both pseudo-sample types can sometimes help, it often introduces noise that degrades results, and using pseudo-samples alone is ineffective. Overall, PGA provides a practical approach to reduce labeling costs and boost RE performance in specialized domains through careful prompt design and data post-processing.

Abstract

Relation Extraction (RE) aims at recognizing the relation between pairs of entities mentioned in a text. Advances in LLMs have had a tremendous impact on NLP. In this work, we propose a textual data augmentation framework called PGA for improving the performance of models for RE in the scientific domain. The framework introduces two ways of data augmentation, utilizing a LLM to obtain pseudo-samples with the same sentence meaning but with different representations and forms by paraphrasing the original training set samples. As well as instructing LLM to generate sentences that implicitly contain information about the corresponding labels based on the relation and entity of the original training set samples. These two kinds of pseudo-samples participate in the training of the RE model together with the original dataset, respectively. The PGA framework in the experiment improves the F1 scores of the three mainstream models for RE within the scientific domain. Also, using a LLM to obtain samples can effectively reduce the cost of manually labeling data.
Paper Structure (23 sections, 4 equations, 3 figures, 6 tables)

This paper contains 23 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: General framework of the PGA. The left part of the Figure shows the prompt composition of the framework for both two data augmentation approaches. The two prompt containing the original training samples are iteratively input into GPT-3.5, respectively, to synthesize two kind of pseudo-samples, which are post-processed, filtered, and converted into the format required by each ERE model to engage in fine-tuning along with the original training samples.
  • Figure 2: Performance variation of SpERT Eberts2019SpanbasedJE model with different numbers of pseudo samples and original training set involved in training, the horizontal coordinate represents the number of pseudo samples involved and the vertical coordinate represents the F1 score.
  • Figure 3: Distribution of the embeddings of the pseudo-samples and the sentences of the original dataset in the vector space.