Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach
Xuying Li, Zhuo Li, Yuji Kosuga, Victor Bian
TL;DR
The paper tackles the problem of aligning large language models to human values under multiple, potentially conflicting objectives such as safety and usefulness. It introduces Group Relative Policy Optimization (GRPO) together with a learned multi-label reward model that predicts scores for four axes (politeness, meaningfulness, actionability, safety) and forms a scalar $R(s,a)$ to guide training. The authors provide theoretical justification for using a learned reward in GRPO, and demonstrate empirical gains on adversarial prompts across 0.5B, 7B, and 14B-scale models with LoRA fine-tuning, showing improved safety and overall alignment while reducing computational overhead compared to PPO-based RLHF and DPO. Qualitative and human evaluations corroborate higher safety and politeness, and ablations show the value of maintaining multiple objective axes rather than collapsing to a single reward. This framework offers a scalable, interpretable path toward safe and aligned LLMs, with open-source release and potential applicability to broader domains and objectives.
Abstract
Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, but is complex and can be unstable. Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives~\cite{dpo}. In this work, we propose a Group Relative Policy Optimization (GRPO) framework with a multi-label reward regression model to achieve safe and aligned language generation. The GRPO algorithm optimizes a policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency~\cite{grpo}. We train a reward model to predict multiple alignment scores (e.g., safety, helpfulness, etc.), which are combined into a single reward signal. We provide a theoretical derivation for using this learned multi-aspect reward within GRPO and discuss its advantages and limitations. Empirically, our approach improves all the safety and quality metrics evaluated in language generation tasks on model scales (0.5B, 7B, and 14B parameters), demonstrating a robust balance of objectives. We compare GRPO to PPO-based RLHF and DPO, highlighting that GRPO achieves alignment with significantly lower computational cost and explicit multi-objective handling. \textbf{We will open-source all trained models at https://huggingface.co/hydroxai.
