Table of Contents
Fetching ...

An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura

TL;DR

The paper empirically compares seven model-merging algorithms for mitigating social bias in 13 LLMs (GPT, LLaMA, Qwen) using BBQ, BOLD, and HONEST as bias benchmarks and SuperGLUE for downstream evaluation. It finds a consistent trade-off: stronger bias reduction often degrades reading-comprehension and commonsense/causal reasoning tasks. Among methods, Linear, SLERP, and Nearswap best balance bias mitigation with utility, with SLERP at moderate interpolation (α ≈ 0.2–0.3) offering the most practical compromise. The work highlights that excessive debiasing or certain merging dynamics can erode core linguistic capabilities, pointing to future work on task-aware or joint merging strategies to preserve downstream performance.

Abstract

Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

TL;DR

The paper empirically compares seven model-merging algorithms for mitigating social bias in 13 LLMs (GPT, LLaMA, Qwen) using BBQ, BOLD, and HONEST as bias benchmarks and SuperGLUE for downstream evaluation. It finds a consistent trade-off: stronger bias reduction often degrades reading-comprehension and commonsense/causal reasoning tasks. Among methods, Linear, SLERP, and Nearswap best balance bias mitigation with utility, with SLERP at moderate interpolation (α ≈ 0.2–0.3) offering the most practical compromise. The work highlights that excessive debiasing or certain merging dynamics can erode core linguistic capabilities, pointing to future work on task-aware or joint merging strategies to preserve downstream performance.

Abstract

Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

Paper Structure

This paper contains 33 sections, 2 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: An overview of social bias mitigation process based on model merging methods.
  • Figure 2: The BBQ evaluation results. Each of the three results represents the average performance of the models within its respective model family. The blue, orange, green, red, purple, brown, and pink lines correspond to the results for Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, respectively. The scores of setting the weight $\alpha$ to zero are resulted using the pre-trained LLMs.
  • Figure 3: The BOLD evaluation results. Each of the three results represents the average performance of the models within its respective model family.
  • Figure 4: The HONEST evaluation results. Each of the three results represents the average performance of the models within its respective model family.
  • Figure 5: The SuperGLUE evaluation results. Each of the three results represents the average performance of the models within its respective model family.
  • ...and 13 more figures