Table of Contents
Fetching ...

BiasICL: In-Context Learning and Demographic Biases of Vision Language Models

Sonnet Xu, Joseph Janizek, Yixing Jiang, Roxana Daneshjou

TL;DR

This work investigates how in-context learning (ICL) prompts shape demographic fairness in vision-language models (VLMs) applied to medical imaging. Using CheXpert and DDI chest radiographs, plus a skin-lesion dataset with Fitzpatrick skin types, the authors test three API-based VLMs under varied demonstration prompts and assess bias via three measures tied to prompt base rates. They show that VLMs display a majority label bias, a demographic group majority label bias, and that ICL can amplify disparities between demographic subgroups even when base rates are balanced. The findings inform practical prompting guidelines and highlight the need for deeper theoretical understanding and careful subgroup evaluation when deploying VLMs in clinical contexts.

Abstract

Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.

BiasICL: In-Context Learning and Demographic Biases of Vision Language Models

TL;DR

This work investigates how in-context learning (ICL) prompts shape demographic fairness in vision-language models (VLMs) applied to medical imaging. Using CheXpert and DDI chest radiographs, plus a skin-lesion dataset with Fitzpatrick skin types, the authors test three API-based VLMs under varied demonstration prompts and assess bias via three measures tied to prompt base rates. They show that VLMs display a majority label bias, a demographic group majority label bias, and that ICL can amplify disparities between demographic subgroups even when base rates are balanced. The findings inform practical prompting guidelines and highlight the need for deeper theoretical understanding and careful subgroup evaluation when deploying VLMs in clinical contexts.

Abstract

Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.

Paper Structure

This paper contains 11 sections, 5 figures.

Figures (5)

  • Figure 1: Overview. CheXpert and DDI (a) were used to investigate a variety of different biases, including: (b) Majority label bias, or the tendency of models to predict more prevalent labels in the prompt more frequently; (c) a new bias introduced in our paper called group majority label bias, or the tendency of models to be swayed by the majority label seen using ICL within a particular demographic subgroup when encountering test examples from that same subgroup; and (d) ICL bias, or the extent to which models learn disparities between groups as the number of demos in a prompt increases. In (b-d), orange and blue bars represent different demographic groups, and the height of each bar represents the fraction of positive labels in the prompt within that subgroup.
  • Figure 2: Majority label bias -- models more frequently predict labels that are more frequent in the prompt. (a-c) Prediction of malignancy on the DDI dataset. (d-f) Prediction of pneumothorax on the CheXpert dataset. Error bars = standard error over three independent runs with different random seeds for selection of demonstration examples from the dataset and ordering of demonstrations in the prompts.
  • Figure 3: Demographic group majority label bias. (a-c) Malignancy prediction on DDI dataset; (e-g) Pneumothorax prediction on CheXpert. Error bars = standard error over three independent runs with different random seeds for demonstration selection and prompt ordering. (d) 0-shot accuracy for patient Fitzpatrick skin type prediction from dermatology images; (h) maximum 0-to-50-shot accuracy for patient sex prediction from chest radiographs.
  • Figure 4: The impact of ICL on the difference between models' average predictions across subgroups when the base rate of positive labels is set equal between subgroups in the prompt.
  • Figure 5: The impact of ICL on GPT-4o's predictive performance on the DDI dataset when (a) adding only FST V/VI demos, (b) adding only FST I/II demos, and (c) adding equal numbers of both.