Table of Contents
Fetching ...

Simple Data Augmentation Techniques for Chinese Disease Normalization

Wenqian Cui, Xiangling Fu, Shaohui Liu, Mingjun Gu, Xien Liu, Ji Wu, Irwin King

TL;DR

This work tackles the data scarcity problem in Chinese disease name normalization by introducing Disease Data Augmentation (DDA), which exploits the Structural Invariance of axis words and the Hierarchy of ICD-10 to generate diverse, semantically consistent training pairs. The pipeline combines a NER module (BiLSTM-CRF) to identify axis words, a data augmentation module with Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA), and a semantic filtering step using normalized n-gram and BERT-based similarities, followed by a pre-training and fine-tuning training regime. Empirical results on CHIP-CDN show that DDA consistently improves performance over standard data augmentation baselines (EDA, back-translation) across several baselines, with pronounced gains in data-scarce settings and notable zero-shot efficacy (e.g., BiHardNCE recovering ~80% of full performance). When compared to large language models, the DDA-enhanced CDN-Baseline achieves comparable or superior effectiveness with orders of magnitude smaller model size, highlighting its practical value for scalable, domain-specific normalization in healthcare. The findings suggest that leveraging domain-specific structural and hierarchical properties for data augmentation can significantly bolster performance in medical NLP tasks where labeled data are scarce.

Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Our proposed methods rely on the Structural Invariance property of disease names and the Hierarchy property of the disease classification system. The goal is to equip the models with extensive understanding of the disease names and the hierarchical structure of the disease name classification system. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data.

Simple Data Augmentation Techniques for Chinese Disease Normalization

TL;DR

This work tackles the data scarcity problem in Chinese disease name normalization by introducing Disease Data Augmentation (DDA), which exploits the Structural Invariance of axis words and the Hierarchy of ICD-10 to generate diverse, semantically consistent training pairs. The pipeline combines a NER module (BiLSTM-CRF) to identify axis words, a data augmentation module with Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA), and a semantic filtering step using normalized n-gram and BERT-based similarities, followed by a pre-training and fine-tuning training regime. Empirical results on CHIP-CDN show that DDA consistently improves performance over standard data augmentation baselines (EDA, back-translation) across several baselines, with pronounced gains in data-scarce settings and notable zero-shot efficacy (e.g., BiHardNCE recovering ~80% of full performance). When compared to large language models, the DDA-enhanced CDN-Baseline achieves comparable or superior effectiveness with orders of magnitude smaller model size, highlighting its practical value for scalable, domain-specific normalization in healthcare. The findings suggest that leveraging domain-specific structural and hierarchical properties for data augmentation can significantly bolster performance in medical NLP tasks where labeled data are scarce.

Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Our proposed methods rely on the Structural Invariance property of disease names and the Hierarchy property of the disease classification system. The goal is to equip the models with extensive understanding of the disease names and the hierarchical structure of the disease name classification system. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data.
Paper Structure (28 sections, 4 equations, 8 figures, 8 tables, 4 algorithms)

This paper contains 28 sections, 4 equations, 8 figures, 8 tables, 4 algorithms.

Figures (8)

  • Figure 1: Examples and illustrations of the disease name normalization task
  • Figure 2: Data scarcity problem in commonly-used disease name normalization and related datasets. Left: The number of disease names presented in the CHIP-CDN training set versus the total number of disease names classified by the first letter. Right: The percentage of disease concepts mentioned in various datasets.
  • Figure 3: A taxonomy of biomedical entity linking methods. Our approach falls into the data augmentation category within the disease name normalization task.
  • Figure 4: The overall pipeline of our proposed Disease Name Normalization (DDA) approach. AR1, AR2, MGA-Code, and MGA-Region are the four proposed data augmentation techniques, and their details are illustrated in Figure \ref{['methods']}.
  • Figure 5: Illustration of our proposed data augmentation techniques. The upper portion of the figure depicts the Axis-word Replacement methods, and the lower portion depicts the Multi-Granularity Aggregation methods.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2