Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

Shaina Raza; Oluwanifemi Bamgbose; Shardul Ghuge; Fatemeh Tavakol; Deepak John Reji; Syed Raza Bashir

Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Fatemeh Tavakol, Deepak John Reji, Syed Raza Bashir

TL;DR

This work addresses the tension between bias reduction and knowledge retention in large language models by introducing SR$_{\text{LLM}}$, an instruction-finetuned, decoder-only LLM built on safety-guarded bases like Llama. It leverages a large, curated Content Moderation Dataset (CMD) of unsafe-safe paired examples and employs parameter-efficient fine-tuning (QLoRA and prefix-tuning) to enable scalable deployment. Across in-house and out-of-distribution benchmarks (Toxigen, BOLD, StereoSet) and a case study on job postings, SR$_{\text{LLM}}$ demonstrates reduced bias and toxicity while preserving language understanding (LMS, ICAT) and content quality, outperforming smaller supervised baselines and prompting-only baselines. The results suggest that targeted instruction fine-tuning on curated debiasing data can effectively minimize harmful biases without sacrificing model capabilities, with practical implications for safe AI deployment in high-stakes domains.

Abstract

Large Language Models (LLMs) have advanced various Natural Language Processing (NLP) tasks, such as text generation and translation, among others. However, these models often generate texts that can perpetuate biases. Existing approaches to mitigate these biases usually compromise knowledge retention. This study explores whether LLMs can produce safe, unbiased outputs without sacrificing knowledge or comprehension. We introduce the Safe and Responsible Large Language Model (\textbf{SR}$_{\text{LLM}}$), which has been instruction fine-tuned atop of a safe fine-tuned auto-regressive decoder-only LLM to reduce biases in generated texts. We developed a specialized dataset with examples of unsafe and corresponding safe variations to train \textbf{SR}$_{\text{LLM}}$ to identify and correct biased text. Experiments on our specialized dataset and out-of-distribution test sets reveal that \textbf{SR}$_{\text{LLM}}$ effectively reduces biases while preserving knowledge integrity. This performance surpasses that of traditional fine-tuning of smaller language models and base LLMs that merely reply on prompting techniques. Our findings demonstrate that instruction fine-tuning on custom datasets tailored for tasks such as debiasing is a highly effective strategy for minimizing bias in LLM while preserving their inherent knowledge and capabilities. The code and dataset are accessible at \href{https://github.com/shainarazavi/Safe-Responsible-LLM}{SR-LLM}

Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

TL;DR

This work addresses the tension between bias reduction and knowledge retention in large language models by introducing SR

, an instruction-finetuned, decoder-only LLM built on safety-guarded bases like Llama. It leverages a large, curated Content Moderation Dataset (CMD) of unsafe-safe paired examples and employs parameter-efficient fine-tuning (QLoRA and prefix-tuning) to enable scalable deployment. Across in-house and out-of-distribution benchmarks (Toxigen, BOLD, StereoSet) and a case study on job postings, SR

demonstrates reduced bias and toxicity while preserving language understanding (LMS, ICAT) and content quality, outperforming smaller supervised baselines and prompting-only baselines. The results suggest that targeted instruction fine-tuning on curated debiasing data can effectively minimize harmful biases without sacrificing model capabilities, with practical implications for safe AI deployment in high-stakes domains.

Abstract

), which has been instruction fine-tuned atop of a safe fine-tuned auto-regressive decoder-only LLM to reduce biases in generated texts. We developed a specialized dataset with examples of unsafe and corresponding safe variations to train \textbf{SR}

to identify and correct biased text. Experiments on our specialized dataset and out-of-distribution test sets reveal that \textbf{SR}

effectively reduces biases while preserving knowledge integrity. This performance surpasses that of traditional fine-tuning of smaller language models and base LLMs that merely reply on prompting techniques. Our findings demonstrate that instruction fine-tuning on custom datasets tailored for tasks such as debiasing is a highly effective strategy for minimizing bias in LLM while preserving their inherent knowledge and capabilities. The code and dataset are accessible at \href{https://github.com/shainarazavi/Safe-Responsible-LLM}{SR-LLM}

Paper Structure (55 sections, 2 equations, 6 figures, 12 tables)

This paper contains 55 sections, 2 equations, 6 figures, 12 tables.

Introduction
Research questions
Research objectives
Contributions
Methodology
Preliminaries
Data preparation
Model
Training objective
Instruction design
Efficient parameterization for scalable fine-tuning
Experimental Setup
Training details and hyperparameters
Evaluation data
Baselines
...and 40 more sections

Figures (6)

Figure 1: Framework for SR$_{\text{LLM}}$, showing an end-to-end process. It starts with the content moderation dataset preparation, where original texts are annotated with labels (bias, toxicity, harm and sentiment), in particular the human annotated gold label "benign text" generation. The instruction dataset is then utilized for instruction fine-tuning during the training phase. The model is evaluated for accuracy, fairness, content diversity and text styles, and knowledge retention. The merged model weights result in a model capable of generating benign variations of unsafe content.
Figure 2: Format for Instruction.
Figure 3: One-Sample t-Test Result for Safety Measures. This graph shows the t-distribution after safety interventions on 16,602 examples. The black dashed line shows the mean (20.19), and the green solid line marks the observed t-value (28.17). Red dashed lines and shaded areas indicate critical t-value thresholds and regions for rejecting the null hypothesis.
Figure 4: Comparison of Stylistic Traits Before and After Safety Intervention
Figure 5: Safety vs. Language Understanding Scores. Presented are percentages, reflecting averages from 100 samples for each model variant. Safe-PEFT-1_ep, our current setting for SR$_{\text{LLM}}$, shows the highest language understanding and safe text generation.
...and 1 more figures

Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

TL;DR

Abstract

Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)