Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models
Yue Xu, Chengyan Fu, Li Xiong, Sibei Yang, Wenjie Wang
TL;DR
This work tackles gender bias in large language models by introducing FaIRMaker, an automated, model-agnostic framework that combines an auto-search phase for debiasing triggers (Fairwords) with a refinement stage that converts them into transferable, natural-language instructions. The approach maintains task performance across both open-source and API-based LLMs, addressing limitations of manual debiasing prompts and parameter-tuning methods. Through extensive experiments on diverse benchmarks and models, FaIRMaker demonstrates effective bias mitigation while preserving utility, and shows extendability with methods like Direct Preference Optimization. The findings offer practical pathways for deploying fairer LLMs at scale and provide interpretability through analysis of Fairwords diversity and emotional cues.
Abstract
Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose $\textbf{FaIRMaker}$, an automated and model-independent framework that employs an $\textbf{auto-search and refinement}$ paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that FaIRMaker automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.
