Table of Contents
Fetching ...

Exploring Safety-Utility Trade-Offs in Personalized Language Models

Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi

TL;DR

The paper examines personalization bias in LLMs, showing that revealing or inferring a user's identity can shift safety and utility of responses. It introduces a formal PB metric, PB(U) = \sqrt{ \mathbb{E}_{u \sim \mathcal{U}} [ \| f(u) - \mu(\mathcal{U}) \|^2 ] } with \mu(\mathcal{U}) = \mathbb{E}_{u \in \mathcal{U}} [ f(u) ], and evaluates 31 identities across multiple models on utility tasks (MMLU, GSM8K, MBPP) and safety probes (DNA, StrongReject). Results show substantial identity-dependent performance variation and safety-utility trade-offs across both open-source and API-based LLMs, including intersectional identities that yield distinct outcomes. The authors propose mitigation strategies—preference tuning via DPO and prompt-based defenses—that reduce personalization bias (PB) but do not eliminate it, highlighting ongoing challenges for equitable deployment. The work has practical implications for deploying personalized LLMs and motivates future research on robust, identity-aware safety and utility guarantees.

Abstract

As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.

Exploring Safety-Utility Trade-Offs in Personalized Language Models

TL;DR

The paper examines personalization bias in LLMs, showing that revealing or inferring a user's identity can shift safety and utility of responses. It introduces a formal PB metric, PB(U) = \sqrt{ \mathbb{E}_{u \sim \mathcal{U}} [ \| f(u) - \mu(\mathcal{U}) \|^2 ] } with \mu(\mathcal{U}) = \mathbb{E}_{u \in \mathcal{U}} [ f(u) ], and evaluates 31 identities across multiple models on utility tasks (MMLU, GSM8K, MBPP) and safety probes (DNA, StrongReject). Results show substantial identity-dependent performance variation and safety-utility trade-offs across both open-source and API-based LLMs, including intersectional identities that yield distinct outcomes. The authors propose mitigation strategies—preference tuning via DPO and prompt-based defenses—that reduce personalization bias (PB) but do not eliminate it, highlighting ongoing challenges for equitable deployment. The work has practical implications for deploying personalized LLMs and motivates future research on robust, identity-aware safety and utility guarantees.

Abstract

As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they operate fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts with and without personalization. We measure utility by evaluating the LLM's performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) to API-based ones like GPT-3.5 and GPT-4o (Ouyang et al., 2022), exhibit significant variance in performance in terms of safety-utility trade-offs depending on the user's identity. Finally, we discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.
Paper Structure (29 sections, 2 equations, 25 figures, 10 tables)

This paper contains 29 sections, 2 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: An example of personalization bias is shown, where the LLM generates undesirable reasoning and fails to provide the correct answer after personalizing for a Muslim user. This example demonstrates the impact of personalization on the LLM response quality, highlighting the emergence of personalization bias.
  • Figure 2: Utility Bias: Performance of GPT-3.5 when personalized with different user identities on MMLU and GSM8K datasets. The horizontal dotted line (- -) shows model performance without any user identity. For both datasets, we observe that performance varies significantly with different user identities, highlighting utility bias introduced by personalization.
  • Figure 3: Safety Bias: Performance of GPT-3.5 when personalized with different user identities on DNA and StrongReject datasets. For both datasets, we observe that the safety scores vary significantly with different user identities, highlighting safety bias introduced by personalization.
  • Figure 4: Safety-utility plots for open-source LLMs: (top row) Llama-2 (70B), Llama-3.1 (70B), Mixtral 8x7B and closed-source LLMs (bottom row) GPT-3.5 and GPT-4o. We report the performance on DNA and MMLU datasets to measure the safety and utility respectively. We observe that adding different user identity impacts both the utility and safety of the LLM responses. The dotted lines (- -) lines indicate the scores when no user identity is provided.
  • Figure 5: Safety-utility plots for four intersectional user identities on GPT-3.5. We observe that the performance using intersectional user identities can differ significantly from that of their individual components.
  • ...and 20 more figures