Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi
TL;DR
The paper examines personalization bias in LLMs, showing that revealing or inferring a user's identity can shift safety and utility of responses. It introduces a formal PB metric, PB(U) = \sqrt{ \mathbb{E}_{u \sim \mathcal{U}} [ \| f(u) - \mu(\mathcal{U}) \|^2 ] } with \mu(\mathcal{U}) = \mathbb{E}_{u \in \mathcal{U}} [ f(u) ], and evaluates 31 identities across multiple models on utility tasks (MMLU, GSM8K, MBPP) and safety probes (DNA, StrongReject). Results show substantial identity-dependent performance variation and safety-utility trade-offs across both open-source and API-based LLMs, including intersectional identities that yield distinct outcomes. The authors propose mitigation strategies—preference tuning via DPO and prompt-based defenses—that reduce personalization bias (PB) but do not eliminate it, highlighting ongoing challenges for equitable deployment. The work has practical implications for deploying personalized LLMs and motivates future research on robust, identity-aware safety and utility guarantees.
Abstract
As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.
