Table of Contents
Fetching ...

Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models

Precious Jones, Weisi Liu, I-Chan Huang, Xiaolei Huang

TL;DR

The paper addresses the problem of data imbalance in clinical language models for ICD-10 code prediction, examining how label and demographic imbalances affect performance and demographic fairness. It analyzes the MIMIC-IV dataset using three state-of-the-art models (ClinicalBERT, GatorTron, Clinical Longformer) across subgroups defined by gender, race/ethnicity, age, and insurance, applying multiple performance metrics and equality-based fairness analyses. The results show that data imbalance significantly impacts both performance and fairness, but the similarity of subgroup features to the majority class often governs performance more than raw representation, with macro metrics offering more stable insights under imbalance. These findings highlight the need for subgroup-aware evaluation and training strategies to develop equitable and robust clinical LLMs for ICD coding and related healthcare NLP tasks.

Abstract

Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state-of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications.

Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models

TL;DR

The paper addresses the problem of data imbalance in clinical language models for ICD-10 code prediction, examining how label and demographic imbalances affect performance and demographic fairness. It analyzes the MIMIC-IV dataset using three state-of-the-art models (ClinicalBERT, GatorTron, Clinical Longformer) across subgroups defined by gender, race/ethnicity, age, and insurance, applying multiple performance metrics and equality-based fairness analyses. The results show that data imbalance significantly impacts both performance and fairness, but the similarity of subgroup features to the majority class often governs performance more than raw representation, with macro metrics offering more stable insights under imbalance. These findings highlight the need for subgroup-aware evaluation and training strategies to develop equitable and robust clinical LLMs for ICD coding and related healthcare NLP tasks.

Abstract

Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state-of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications.

Paper Structure

This paper contains 38 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Overview of label distribution (ICD-10 codes) by ethnicity group. The codes are arranged in descending order of frequency based on the overall data.
  • Figure 2: Cosine distances of label vectors between Insurance-Ethnicity.