Table of Contents
Fetching ...

Data Augmentation Techniques for Chinese Disease Name Normalization

Wenqian Cui, Xiangling Fu, Shaohui Liu, Mingjun Gu, Xien Liu, Ji Wu, Irwin King

TL;DR

This paper tackles data scarcity in Chinese disease name normalization by introducing Disease Data Augmentation (DDA), a two-part augmentation framework combining Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA) to enrich training signals. An NER module identifies axis words, and a semantic-filtering stage prunes augmented pairs with thresholds on $n$-gram and contextual similarity to produce a high-quality dataset. The pre-training on augmented data followed by fine-tuning on original data yields significant improvements over baselines, especially in limited-data settings, with zero-shot Bi-hardNCE reaching roughly $80\%$ of full performance in key metrics. The approach leverages ICD hierarchical structure and anatomical taxonomies to create diverse, linguistically coherent disease-name pairs, offering practical benefits for clinical NLP and healthcare automation under privacy constraints.

Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data

Data Augmentation Techniques for Chinese Disease Name Normalization

TL;DR

This paper tackles data scarcity in Chinese disease name normalization by introducing Disease Data Augmentation (DDA), a two-part augmentation framework combining Axis-word Replacement (AR) and Multi-Granularity Aggregation (MGA) to enrich training signals. An NER module identifies axis words, and a semantic-filtering stage prunes augmented pairs with thresholds on -gram and contextual similarity to produce a high-quality dataset. The pre-training on augmented data followed by fine-tuning on original data yields significant improvements over baselines, especially in limited-data settings, with zero-shot Bi-hardNCE reaching roughly of full performance in key metrics. The approach leverages ICD hierarchical structure and anatomical taxonomies to create diverse, linguistically coherent disease-name pairs, offering practical benefits for clinical NLP and healthcare automation under privacy constraints.

Abstract

Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data
Paper Structure (10 sections, 3 equations, 2 figures, 1 table)

This paper contains 10 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of our proposed data augmentation techniques. The upper portion of the figure depicts the Axis-word Replacement methods, and the lower portion depicts the Multi-Granularity Aggregation methods.
  • Figure 2: Performance comparison on smaller datasets for BiLSTM and BERT-base. The smaller datasets are derived by randomly sampling a portion of the CHIP-CDN training set. The validation set of CHIP-CDN stays the same.

Theorems & Definitions (3)

  • Definition 1: Disease Name Normalization
  • Definition 2: Axis Word
  • Remark 1