Table of Contents
Fetching ...

WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

Zhengyi Zhao, Shubo Zhang, Bin Liang, Binyang Li, Kam-Fai Wong

TL;DR

Data scarcity in BioNLP hampers reliable Biomedical Language Model reasoning. BioRDA introduces a rationale-based, WHERE-AND-WHICH augmentation framework with a multi-agent debate and an Attribution Selector to preserve biomedical coherence in synthetic data. Evaluations across 9 BLURB/BigBIO datasets show an average improvement of $+2.98\%$ and demonstrate the critical role of WHERE information for context and WHICH information for biomedical terminology. The approach offers a scalable, domain-aware augmentation strategy that mitigates counterfactual data and enhances BioNLP model performance.

Abstract

In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.

WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

TL;DR

Data scarcity in BioNLP hampers reliable Biomedical Language Model reasoning. BioRDA introduces a rationale-based, WHERE-AND-WHICH augmentation framework with a multi-agent debate and an Attribution Selector to preserve biomedical coherence in synthetic data. Evaluations across 9 BLURB/BigBIO datasets show an average improvement of and demonstrate the critical role of WHERE information for context and WHICH information for biomedical terminology. The approach offers a scalable, domain-aware augmentation strategy that mitigates counterfactual data and enhances BioNLP model performance.

Abstract

In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.

Paper Structure

This paper contains 29 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Examples for falsely augmented data. Slightly rephrasing the original sentences with logical errors or mis-replacing similar words leads to a totally wrong models understanding. Further, substituting words from "dose" to "concentration" changes the whole sentence's in-context meaning, and damages model wrong understanding from one sentence to the whole passage.
  • Figure 2: Overview framework of BioRDA. Both lexicon and biomedical-level similarity are evaluated to find the most appropriate position to rephrase in step 1. And step 2 adopts a multi-agent involved system to select the best rephrase candidate.
  • Figure 3: Demo for data reconstruction with lexicon diversity and under relation restriction. The figure on top represents the position that could be replaced by computing lexicon similarity while the bottom one represents the token that is the most relevant to biomedical relations.
  • Figure 4: Pseudo instances generation training process. To proceed with the generation, T5 model is applied to learn the WHERE position information while extracting the syntax feature to help model understand the sentence pattern.