Table of Contents
Fetching ...

Data augmentation method for modeling health records with applications to clopidogrel treatment failure detection

Sunwoong Choi, Samuel Kim

TL;DR

The paper tackles data scarcity in EHR-based NLP by introducing a permutation-based data augmentation that rearranges within-visit codes while preserving the order of medical types, evaluated on clopidogrel treatment failure detection. A BERT-based model is pre-trained with MLM on unlabeled EHR sequences and fine-tuned for binary TF prediction, with an augmentation factor $\alpha$ controlling the number of augmented sequences per patient and formulas such as $p_i = |D_i|! \cdot |O_i|! \cdot |P_i|!$ and $\prod_{i=1}^{n} p_i$ describing the augmentation space. Results show up to a 5.3 percentage point absolute improvement in ROC-AUC (0.908 to 0.961) when augmentation is used during pre-training, with additional benefits in fine-tuning when labeled data are limited; test-time augmentation yields only marginal gains. The approach avoids creating synthetic data, preserves data fidelity, and suggests a practical pathway for data-efficient healthcare NLP models that can serve as foundation models in clinical applications.

Abstract

We present a novel data augmentation method to address the challenge of data scarcity in modeling longitudinal patterns in Electronic Health Records (EHR) of patients using natural language processing (NLP) algorithms. The proposed method generates augmented data by rearranging the orders of medical records within a visit where the order of elements are not obvious, if any. Applying the proposed method to the clopidogrel treatment failure detection task enabled up to 5.3% absolute improvement in terms of ROC-AUC (from 0.908 without augmentation to 0.961 with augmentation) when it was used during the pre-training procedure. It was also shown that the augmentation helped to improve performance during fine-tuning procedures, especially when the amount of labeled training data is limited.

Data augmentation method for modeling health records with applications to clopidogrel treatment failure detection

TL;DR

The paper tackles data scarcity in EHR-based NLP by introducing a permutation-based data augmentation that rearranges within-visit codes while preserving the order of medical types, evaluated on clopidogrel treatment failure detection. A BERT-based model is pre-trained with MLM on unlabeled EHR sequences and fine-tuned for binary TF prediction, with an augmentation factor controlling the number of augmented sequences per patient and formulas such as and describing the augmentation space. Results show up to a 5.3 percentage point absolute improvement in ROC-AUC (0.908 to 0.961) when augmentation is used during pre-training, with additional benefits in fine-tuning when labeled data are limited; test-time augmentation yields only marginal gains. The approach avoids creating synthetic data, preserves data fidelity, and suggests a practical pathway for data-efficient healthcare NLP models that can serve as foundation models in clinical applications.

Abstract

We present a novel data augmentation method to address the challenge of data scarcity in modeling longitudinal patterns in Electronic Health Records (EHR) of patients using natural language processing (NLP) algorithms. The proposed method generates augmented data by rearranging the orders of medical records within a visit where the order of elements are not obvious, if any. Applying the proposed method to the clopidogrel treatment failure detection task enabled up to 5.3% absolute improvement in terms of ROC-AUC (from 0.908 without augmentation to 0.961 with augmentation) when it was used during the pre-training procedure. It was also shown that the augmentation helped to improve performance during fine-tuning procedures, especially when the amount of labeled training data is limited.
Paper Structure (14 sections, 6 figures, 1 table)

This paper contains 14 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Proposed data preparation method that augments data. $\bigotimes$ represents a Cartesian product notation. Each box represents medical record, and different colors denote different types of records. Boxes with bold lines simply illustrate different elements from the first sequence.
  • Figure 2: Detection task; to classify if there exist treatment failures within one year after the first prescription.
  • Figure 3: Pre-training procedure using data augmentation strategy.
  • Figure 4: Fine-tuning procedure using data augmentation strategy..
  • Figure 5: ROC curves and area under curves based on augmentation factor in pre-training
  • ...and 1 more figures