FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Peiran Wu; Che Liu; Canyu Chen; Jun Li; Cosmin I. Bercea; Rossella Arcucci

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Peiran Wu, Che Liu, Canyu Chen, Jun Li, Cosmin I. Bercea, Rossella Arcucci

TL;DR

This work proposes FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes, and introduces a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes.

Abstract

Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 8 figures, 7 tables)

This paper contains 20 sections, 3 equations, 8 figures, 7 tables.

Introduction
Related Work
FMBench
Data Source
QA Pair Generation and Optimization
Fairness-Aware Performance
Experiments Configuration
Evaluated Models
Zero-shot Evaluation
Evaluation Metrics
Results
Benchmarking MLLM Performance
Benchmarking MLLM Fairness
Conclusion
Dataset Details
...and 5 more sections

Figures (8)

Figure 1: Overview of the FMBench QA pair construction. (a) This panel showcases two sample entries from the FMBench dataset, derived from the Harvard-FairVLMed dataset. Each entry features a fundus image paired with a clinical report and detailed demographic data. (b) Illustrated here is the LLM-based generation of QA pairs using Llama-3.1-70B-Instruct. The LLM queries clinical reports to produce QA pairs categorized into primary condition or diagnosis, testing or treatment, and medical condition. (c) The inference of QA pairs in VQA and the medical reports generation.
Figure 2: The prompt for LLM scoring. Lexical metrics fall short in evaluating the semantic correctness of VQA and report generation tasks. To overcome this limitation, we directly query an LLM to score the generated results, utilizing Llama-3.1-70B-Instruct llama3.1 for this purpose.
Figure 3: Performance of MLLMs averaged across all demographic attributes. The dashed line shows the relationship between the GREEN and LLM scores. Top Left: Average of 9 lexical scores and demographics on the zero-shot VQA task. Top Right: Average of 9 lexical scores and demographics on the zero-shot report generation task. Bottom Left: Correlation between GREEN and LLM scores on the zero-shot VQA task. Bottom Right: Correlation between GREEN and LLM scores on the zero-shot report generation task.
Figure 4: We provide four samples from LLaVA-Med inference results. Sample 1-3: We can see that the ground truth answers and the predicted answers are higly semantic consistent. Sample 4-6: Ground truth answers and predicted answers consistent at word-level but different in semantics.
Figure 5: GREEN scores for 8 MLLMs across different demographic groups. Top: GREEN scores for the zero-shot VQA task. Bottom: GREEN scores for the zero-shot report generation task.
...and 3 more figures

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

TL;DR

Abstract

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)