Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition
Abdul Rehman, Jian Jun Zhang, Xiaosong Yang
TL;DR
This work tackles the problem of imbalanced entity labels in clinical NER using BERT. It introduces a black-box bias testing framework and shows that conventional training can amplify empirical bias, even under random labeling. To mitigate this, it proposes Binary Token Labels (BTL) with weighted loss and per-class batching, demonstrating improved unweighted accuracy and per-entity F1 with KNN calibration on the MACCROBAT dataset, while preserving overall performance improvements. The approach offers a practical route to more equitable recognition of minority clinical entities and highlights trade-offs across entity types. The findings have implications for deploying clinical NER systems in settings with skewed label distributions, enabling more robust extraction of rare but clinically important terms.
Abstract
Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.
