Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

Deng Yixuan; Ji Xiaoqiang

Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

Deng Yixuan, Ji Xiaoqiang

TL;DR

This work addresses the challenge that LLMs encode culturally specific biases, which Western-centric alignment methods struggle to mitigate. It proposes a Multi-Reward Group Relative Policy Optimization (GRPO) framework that trains a DeBERTa-based fairness reward to guide LLaMA 3.1 fine-tuning with LoRA across fairness, relevance, and linguistic quality. A Chinese-context discrimination dataset is constructed and used to train the reward model, enabling multi-dimensional optimization that reduces regional, ethnic, and occupational biases while preserving fluency and informativeness. Experimental results show substantial gains in fairness (0.74 to 0.93) with negligible changes to fluency and modest improvements in relevance, demonstrating GRPO as a practical, scalable approach to culturally aware ethical alignment. The paper additionally outlines a replicable pipeline and future directions for multilingual, dynamic reward scheduling and expanded ethical taxonomies.

Abstract

Large Language Models (LLMs) often exhibit implicit biases and discriminatory tendencies that reflect underlying social stereotypes. While recent alignment techniques such as RLHF and DPO have mitigated some of these issues, they remain limited in addressing culturally specific and multi-dimensional forms of discrimination. This paper proposes a Multi-Reward Group Relative Policy Optimization (GRPO) framework to fine-tune LLMs toward ethical and bias-free behavior. Our approach constructs a synthetic English-language dataset derived from Chinese-context discrimination categories, including regional, ethnic, and occupational biases. Each instance is paired with both neutral and biased responses to train a reward model based on DeBERTa-v3, which provides multi-dimensional reward signals capturing fairness, neutrality, and linguistic quality. The trained reward model then guides GRPO fine-tuning to optimize model outputs along these ethical dimensions. Experimental results demonstrate significant reductions in bias intensity and improved alignment with non-discriminatory standards without compromising fluency or informativeness. This study highlights the effectiveness of GRPO-based multi-reward optimization for de-biasing LLMs and offers a replicable framework for cultural-contextual ethical alignment.

Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

TL;DR

Abstract

Multi-Reward GRPO Fine-Tuning for De-biasing Large Language Models: A Study Based on Chinese-Context Discrimination Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)