Data Augmentation Method Utilizing Template Sentences for Variable Definition Extraction
Kotaro Nagayama, Shota Kato, Manabu Kano
TL;DR
The paper addresses cross-field variability in variable-definition extraction and the high cost of obtaining field-specific labeled data. It introduces a template-based data augmentation approach that generates new definition sentences by pairing existing variable-definition pairs with template sentences, with templates produced via ChatGPT and managed by a P–T_J–D_TPL,J framework, followed by a three-step fine-tuning pipeline on DeBERTaV3_LARGE. Experiments on chemical-process datasets (and related Symlink data) show that the augmented model achieves higher accuracy and more stable performance than a strong baseline, with notable generalization to unseen processes (e.g., $Acc$ around $88.3\%$ and improved F1). The method reduces labeling burden while enabling robust cross-domain extraction, and analyses reveal how template count and template-content similarity influence results, guiding future improvements in template design and alignment with target-domain definitions.
Abstract
The extraction of variable definitions from scientific and technical papers is essential for understanding these documents. However, the characteristics of variable definitions, such as the length and the words that make up the definition, differ among fields, which leads to differences in the performance of existing extraction methods across fields. Although preparing training data specific to each field can improve the performance of the methods, it is costly to create high-quality training data. To address this challenge, this study proposes a new method that generates new definition sentences from template sentences and variable-definition pairs in the training data. The proposed method has been tested on papers about chemical processes, and the results show that the model trained with the definition sentences generated by the proposed method achieved a higher accuracy of 89.6%, surpassing existing models.
