Table of Contents
Fetching ...

Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition

Abdul Rehman, Jian Jun Zhang, Xiaosong Yang

TL;DR

This work tackles the problem of imbalanced entity labels in clinical NER using BERT. It introduces a black-box bias testing framework and shows that conventional training can amplify empirical bias, even under random labeling. To mitigate this, it proposes Binary Token Labels (BTL) with weighted loss and per-class batching, demonstrating improved unweighted accuracy and per-entity F1 with KNN calibration on the MACCROBAT dataset, while preserving overall performance improvements. The approach offers a practical route to more equitable recognition of minority clinical entities and highlights trade-offs across entity types. The findings have implications for deploying clinical NER systems in settings with skewed label distributions, enabling more robust extraction of rare but clinically important terms.

Abstract

Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.

Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition

TL;DR

This work tackles the problem of imbalanced entity labels in clinical NER using BERT. It introduces a black-box bias testing framework and shows that conventional training can amplify empirical bias, even under random labeling. To mitigate this, it proposes Binary Token Labels (BTL) with weighted loss and per-class batching, demonstrating improved unweighted accuracy and per-entity F1 with KNN calibration on the MACCROBAT dataset, while preserving overall performance improvements. The approach offers a practical route to more equitable recognition of minority clinical entities and highlights trade-offs across entity types. The findings have implications for deploying clinical NER systems in settings with skewed label distributions, enabling more robust extraction of rare but clinically important terms.

Abstract

Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.
Paper Structure (8 sections, 3 equations, 8 figures, 2 tables)

This paper contains 8 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Framework for fine-tuning BERT for token classification using token labels for all the words. This conventional approach of learning multiple token labels at the same time can undermine the importance of sparsely occurring token labels.
  • Figure 2: This histogram shows distributions of the classification layer's output logits for the clinical entity recognition task. Different colours represent entity label classes (only 13 out of 41 are shown here). $A$ is the percentage of tokens that are predicted to belong to each particular class when the logit value for that class is the highest. $N$ is the true percentage of token labels in the test set. $\sigma$ is the standard deviation and $\bar{x}$ is the mean of the corresponding logits' distributions. Small sample classes are marked by '*' as their prediction count does not match the empirically expected count. $A=0\%$ implies that none of the test samples were classified as belonging to the two smaller classes.
  • Figure 3: For the purpose of testing BERT's propensity of learning by empirical evidence, the sensible clinical entity labels are replaced with non-sensible, devoid of any meaningful pattern, randomly assigned token labels.
  • Figure 4: The distribution of predicted 3-class logits for the test set passages after training the BERT-base-cased model for 1, 13 and 30 epochs using the weighted cross-entropy loss for all token labels (ATL) in all batches.
  • Figure 5: An example of splitting a batch to create multiple batches with only one class of true positive (either 'M' or 'N') per batch in contrast with true negative 'O'.
  • ...and 3 more figures