MBIAS: Mitigating Bias in Large Language Models While Retaining Context

Shaina Raza; Ananya Raval; Veronica Chatrath

MBIAS: Mitigating Bias in Large Language Models While Retaining Context

Shaina Raza, Ananya Raval, Veronica Chatrath

TL;DR

MBIAS tackles the challenge of mitigating bias and toxicity in large language models without eroding contextual meaning. It builds a safety-focused instruction-tuning dataset of unsafe-to-benign text pairs and trains a Mistral-7B-Instruct model using parameter-efficient fine-tuning via QLoRA, with LLMs acting as both annotator and evaluator. Empirical results show substantial bias/toxicity reductions ($>30\%$ in standard evaluations and $>90\%$ across diverse demographics) and strong knowledge retention and faithfulness, though some demographic groups still exhibit residual biases. By releasing the dataset and model weights, the work enables reproducibility and practical adoption of debiasing techniques in real-world LLM deployments while acknowledging limitations and ethical considerations for safe AI practice.

Abstract

The deployment of Large Language Models (LLMs) in diverse applications necessitates an assurance of safety without compromising the contextual integrity of the generated content. Traditional approaches, including safety-specific fine-tuning or adversarial testing, often yield safe outputs at the expense of contextual meaning. This can result in a diminished capacity to handle nuanced aspects of bias and toxicity, such as underrepresentation or negative portrayals across various demographics. To address these challenges, we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. This work also details our further use of LLMs: as annotator under human supervision and as evaluator of generated content. Empirical analysis reveals that MBIAS achieves a reduction in bias and toxicity by over 30\% in standard evaluations, and by more than 90\% in diverse demographic tests, highlighting the robustness of our approach. We make the dataset and the fine-tuned model available to the research community for further investigation and ensure reproducibility. The code for this project can be accessed here https://github.com/shainarazavi/MBIAS/tree/main. Warning: This paper contains examples that may be offensive or upsetting.

MBIAS: Mitigating Bias in Large Language Models While Retaining Context

TL;DR

in standard evaluations and

across diverse demographics) and strong knowledge retention and faithfulness, though some demographic groups still exhibit residual biases. By releasing the dataset and model weights, the work enables reproducibility and practical adoption of debiasing techniques in real-world LLM deployments while acknowledging limitations and ethical considerations for safe AI practice.

Abstract

Paper Structure (20 sections, 5 equations, 1 figure, 5 tables)

This paper contains 20 sections, 5 equations, 1 figure, 5 tables.

Introduction
Related Works
Method
Dataset Preparation
Model Training
Efficient Fine-Tuning with QLoRA
Experiments
Experimental Setting
Evaluation Data, Metrics, and Baselines
Evaluation Data
Evaluation Metrics
Baselines
Results
Overall Results
Performance of MBIAS across Different Demographics
...and 5 more sections

Figures (1)

Figure 1: MBIAS architecture showing data preparation and model training with parameter efficient fine tuning.

MBIAS: Mitigating Bias in Large Language Models While Retaining Context

TL;DR

Abstract

MBIAS: Mitigating Bias in Large Language Models While Retaining Context

Authors

TL;DR

Abstract

Table of Contents

Figures (1)