Table of Contents
Fetching ...

DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

Zikang Xu, Ruinan Jin, Xiaoxiao Li

TL;DR

A stage-wise fairness decomposition is introduced that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors).

Abstract

Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md

DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

TL;DR

A stage-wise fairness decomposition is introduced that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors).

Abstract

Tool-using medical agents can improve chest X-ray question answering by orchestrating specialized vision and language modules, but this added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present ours (Decomposing Unfairness in Chest X-ray agents), a systematic audit of chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias (utility gaps conditioned on tool presence), tool transition bias (subgroup differences in tool-routing patterns), and model reasoning bias (subgroup differences in synthesis behaviors). Extensive experiments on tool-used based agentic frameworks across five driver backbones reveal that (i) demographic gaps persist in end-to-end performance, with equalized odds up to 20.79%, and the lowest fairness-utility tradeoff down to 28.65%, and (ii) intermediate behaviors, tool usage, transition patterns, and reasoning traces exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone (e.g., conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%). Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code is available here: https://anonymous.4open.science/r/DUCK-E5FE/README.md
Paper Structure (8 sections, 5 equations, 3 figures, 2 tables)

This paper contains 8 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of DUCX. (1) Dataset curation: we use CheXAgentBench and curate MIMIC-FairnessVQA into a standardized form. (2) MedRAX agent execution (ReAct): a driver LLM iteratively reasons, selects tools, and synthesizes a final answer, where we highlight multiple points where bias can be introduced (tool exposure, tool transitions, and final synthesis). (3) Fairness decomposition: We evaluate end-to-end fairness (ACC, $\Delta$ACC, DP, EoD, FUT) and decompose it into tool exposure bias (gaps conditioned on tool presence), tool transition bias (gaps in tool routing), and LLM reasoning bias (gaps in synthesis behaviors).
  • Figure 2: Tool exposure bias across tools and subgroups. (a) Gender on CheXAgentBench, (b) Age on CheXAgentBench, (c) Gender on MIMIC-FairnessVQA, and (d) Age on MIMIC-FairnessVQA.
  • Figure 3: Tool transition biases represented as matrices. START: start indicator; CLS: classifier; RG: report generator; SEG: segmentator; VIS: visualizer; GRD: phrase grounding. Cells in red denote values larger than zero, while cells in blue denote values smaller than zero.$\Delta_{\text{TTB}}^{\text{gender}} = P^{\text{male}} - P^{\text{female}}$, $\Delta_{\text{TTB}}^{\text{age}} = P^{\text{young}} - P^{\text{old}}$. All subfigures present results averaged across different LLMs.