Table of Contents
Fetching ...

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

Abstract

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

Abstract

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

Paper Structure

This paper contains 29 sections, 11 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: FairLLaVA reduces performance disparities. LLaVA hidden states contain demographic shortcuts (non-zero Mutual Information (MI) between hidden states and demographic attributes) that lead to lower performance for “Female". FairLLaVA minimizes this MI promoting demographic-invariant representation learning, therefore reducing the performance gap.
  • Figure 2: FairLLaVA Overview.Stage 1: We finetune multi-modal projector $\psi$ to align the image embeddings with Language Model (LM) by optimizing standard LM CE loss $\mathcal{L}_{LM}$. Image encoder and LM are frozen. Stage 2: We learn attribute-invariant representations by finetuning LoRA adapters $\theta$ on the LM’s Transformer decoder blocks while freezing the LM backbone and image encoder, pooled hidden states $h$ are fed to a mutual-information (MI) estimator with a variational demographic-attribute classifier (DAC) denoted as ${\phi}$ that predicts demography attribute from $h$. No pretrained classifier is required: the DAC is trained simultaneously with cross-entropy $\mathcal{L}_{\text{DAC}}$ (Eq. \ref{['eq:l_DAC']}), and during this step, gradients update only $\phi$. We then minimize MI between demographic attribute and $h$ given by $\mathcal{L}_{\text{DIM}}$ (Eq. \ref{['eq:club_batch']}), computed between pooled states $h$ and attributes $\mathbf{a}$, ${\phi}$ is frozen, and only $\theta$ and $\psi$ are updated. This way DAC exposes where leakage in $h$ for predicting $\mathbf{a}$ is coming from, $\mathcal{L}_{DIM}$ suppresses this leakage making learned features demography shortcut invariant.
  • Figure S1: Hyper Parameters Sensitivity (a) Varying the contribution of each attribute-specific MI term to the total loss on MIMIC-CXR leads to only minor changes, indicating stable overall performance across attributes. (b) Varying the contribution of language model loss $\mathcal{L}_{LM}$ leads to minor changes in overall performance.
  • Figure S2: 95% Confidence Intervals of ES metric on MIMIC-CXR with bootstrap resampling ($n{=}1000$)
  • Figure S3: Sub-Group performance of baselines across “Race”, “Age” and “Gender” subgroups on MIMIC-CXR dataset on GREEN and BLEU-4 metric. We observe that the high number of counts in the train dataset does not correlate with the increased performance. Please also check \ref{['fig:dem_dist']}
  • ...and 4 more figures