Table of Contents
Fetching ...

Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection

Amalia Ionescu, Jose Guadalupe Hernandez, Jui-Hsuan Chang, Emily F. Wong, Paul Wang, Jason H. Moore, Tiffani J. Bright

Abstract

Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.

Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection

Abstract

Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.
Paper Structure (7 sections, 1 figure, 3 tables)

This paper contains 7 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Distribution of semantic similarity scores between output generated by Agents 1 and 2 across three conditions (LLM-only, Agent without RAG, and Agent with RAG) and the relevant ground truth statements. Results are shown for two agent roles: the Domain Expert Agent (Agent 1) and the Fairness Consultant Agent (Agent 2) across three model sizes (Llama 3.1 8B, OSS 20B, and OSS 120B). Violin plots display the distribution of similarity scores with embedded boxplots and individual observations. Comparisons are shown for the LLM and Agent without RAG (Agent NR) and Agent with RAG (Agent R). Statistical significance of pairwise comparisons is indicated by asterisks, with reported p-values where differences are not statistically significant.