Table of Contents
Fetching ...

Data Augmentation Method Utilizing Template Sentences for Variable Definition Extraction

Kotaro Nagayama, Shota Kato, Manabu Kano

TL;DR

The paper addresses cross-field variability in variable-definition extraction and the high cost of obtaining field-specific labeled data. It introduces a template-based data augmentation approach that generates new definition sentences by pairing existing variable-definition pairs with template sentences, with templates produced via ChatGPT and managed by a P–T_J–D_TPL,J framework, followed by a three-step fine-tuning pipeline on DeBERTaV3_LARGE. Experiments on chemical-process datasets (and related Symlink data) show that the augmented model achieves higher accuracy and more stable performance than a strong baseline, with notable generalization to unseen processes (e.g., $Acc$ around $88.3\%$ and improved F1). The method reduces labeling burden while enabling robust cross-domain extraction, and analyses reveal how template count and template-content similarity influence results, guiding future improvements in template design and alignment with target-domain definitions.

Abstract

The extraction of variable definitions from scientific and technical papers is essential for understanding these documents. However, the characteristics of variable definitions, such as the length and the words that make up the definition, differ among fields, which leads to differences in the performance of existing extraction methods across fields. Although preparing training data specific to each field can improve the performance of the methods, it is costly to create high-quality training data. To address this challenge, this study proposes a new method that generates new definition sentences from template sentences and variable-definition pairs in the training data. The proposed method has been tested on papers about chemical processes, and the results show that the model trained with the definition sentences generated by the proposed method achieved a higher accuracy of 89.6%, surpassing existing models.

Data Augmentation Method Utilizing Template Sentences for Variable Definition Extraction

TL;DR

The paper addresses cross-field variability in variable-definition extraction and the high cost of obtaining field-specific labeled data. It introduces a template-based data augmentation approach that generates new definition sentences by pairing existing variable-definition pairs with template sentences, with templates produced via ChatGPT and managed by a P–T_J–D_TPL,J framework, followed by a three-step fine-tuning pipeline on DeBERTaV3_LARGE. Experiments on chemical-process datasets (and related Symlink data) show that the augmented model achieves higher accuracy and more stable performance than a strong baseline, with notable generalization to unseen processes (e.g., around and improved F1). The method reduces labeling burden while enabling robust cross-domain extraction, and analyses reveal how template count and template-content similarity influence results, guiding future improvements in template design and alignment with target-domain definitions.

Abstract

The extraction of variable definitions from scientific and technical papers is essential for understanding these documents. However, the characteristics of variable definitions, such as the length and the words that make up the definition, differ among fields, which leads to differences in the performance of existing extraction methods across fields. Although preparing training data specific to each field can improve the performance of the methods, it is costly to create high-quality training data. To address this challenge, this study proposes a new method that generates new definition sentences from template sentences and variable-definition pairs in the training data. The proposed method has been tested on papers about chemical processes, and the results show that the model trained with the definition sentences generated by the proposed method achieved a higher accuracy of 89.6%, surpassing existing models.
Paper Structure (15 sections, 6 equations, 6 figures, 6 tables)

This paper contains 15 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the proposed method. $\mathcal{T}_J$ represents a set of $J$ types of template sentences, $\mathcal{P}$ is a list storing variable-definition pairs included in the training data, and $\mathcal{D}_{\mathrm{TPL},J}$ is a list for storing the generated sentences.
  • Figure 2: Examples of template sentences, each containing pairs of variable and definition tokens or solely symbol tokens. The number of variable tokens in each template sentence ranges from one to six.
  • Figure 3: Schematic diagram of the variable definition extraction method by Yamamoto et al. yamamoto. The vectors $\bm{s}_{\mathrm{start}}$ and $\bm{s}_{\mathrm{end}}$ represent the probabilities of each token being the start and end positions of a definition, respectively. The tokens predicted as the start and end positions of a definition, corresponding to $s_{\mathrm{start},i}$ and $s_{\mathrm{end},j}$, are highlighted in yellow.
  • Figure 4: Distribution of each evaluation metric across ten experiments. $J=J'$ in the figure corresponds to the distribution of the Proposed$_{J=J'}$. In each box, the orange horizontal line represents the median and the mark $\times$ represents the mean value.
  • Figure 5: Distribution of $Acc.$ across ten experiments using two methods: the method by Yamamoto et al. with $\mathcal{D}_\mathrm{Process}$ as training data (left side of each figure) and Proposed$_{J=300}$ with $\mathcal{D}_{\mathrm{Process}-X}$ as training data (right side of each figure). The orange line indicates the median and the mark $\times$ represents the mean.
  • ...and 1 more figures