Table of Contents
Fetching ...

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman

TL;DR

Cross-Care introduces a benchmark and workflow to quantify how pretraining data biases shape LLM representations of disease prevalence across US demographic subgroups. It computes co-occurrence-based representations from The Pile and evaluates model-derived demographic rankings through average logits over 10 templates, comparing these against real-world prevalence data across multiple languages and alignment methods. The study finds substantial misalignment between model-derived prevalence rankings and real epidemiology, with Kendall's tau near zero for many comparisons and limited mitigation from standard alignment techniques, indicating weak grounding in real-world medical knowledge. A public web portal at crosscare.net provides access to counts, logits, and associations to support ongoing interpretability, robustness, and fairness research in healthcare NLP.

Abstract

Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

TL;DR

Cross-Care introduces a benchmark and workflow to quantify how pretraining data biases shape LLM representations of disease prevalence across US demographic subgroups. It computes co-occurrence-based representations from The Pile and evaluates model-derived demographic rankings through average logits over 10 templates, comparing these against real-world prevalence data across multiple languages and alignment methods. The study finds substantial misalignment between model-derived prevalence rankings and real epidemiology, with Kendall's tau near zero for many comparisons and limited mitigation from standard alignment techniques, indicating weak grounding in real-world medical knowledge. A public web portal at crosscare.net provides access to counts, logits, and associations to support ongoing interpretability, robustness, and fairness research in healthcare NLP.

Abstract

Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.
Paper Structure (57 sections, 4 equations, 22 figures, 8 tables)

This paper contains 57 sections, 4 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Overall workflow of Cross-Care, our detailed multi-lingual templates for accessing diseases prevalence among different demographic subgroups can be found in Appendix \ref{['templates']} Table \ref{['tab:templates']}.
  • Figure 2: Comparison of disease rankings between $ThePile$, Llama3's logits and real-world data. (1: most prevalent, 5: least prevalent)
  • Figure 3: a) Top ranked gender (top) and race/ethnicity (bottom) subgroups across 89 diseases and the suite of Pythia and Mamba models according to logits results (stacked bars). Co-occurrence and logit rank match demonstrate the number of diseases for which the top-ranked demographic subgroup is the same when calculated using co-occurrences and logits (black line). Demographic subgroups that did not appear as the top-ranked group are not shown. b) Kendall's tau of Mamba and Pythia's logits vs co-occurrence, and real prevalence for gender (top) and race/ethnicity (bottom).
  • Figure 4: Top ranked gender and race/ethnicity subgroups across each of the 89 diseasese and different alignments of methods for Llama2 models according to logits results (stacked bars).
  • Figure 5: Mean frequency of agreement for each model's highest ranking racial demographic group across all diseases. Maximum possible value = 10. Error bars are Standard Error values across the unique number of diseases.
  • ...and 17 more figures