Table of Contents
Fetching ...

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

Mingqian Zheng, Wenjia Hu, Patrick Zhao, Motahhare Eslami, Jena D. Hwang, Faeze Brahman, Carolyn Rose, Maarten Sap

TL;DR

This study interrogates how LLM guardrails shape user experience by disentangling the effects of user motivation from the manner in which refusals are delivered. By introducing a five-strategy refusal taxonomy and the QueryShift probe dataset, the authors show that alignment with user expectations drives perceived safety and satisfaction far more than whether the user is benign or malicious. A key finding is that partial compliance—providing general information without actionable details—consistently yields the best user experience, reducing negative perceptions by more than half compared with direct refusals. The work also reveals a misalignment between user preferences, LLM refusal behaviors, and reward-model training, underscoring the need to reorient guardrail design toward thoughtful refusals that preserve engagement while maintaining safety. Overall, the paper advocates a human-centered shift in AI safety focus from intent detection to crafting contextually appropriate refusals to improve safety and user trust.

Abstract

Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences

TL;DR

This study interrogates how LLM guardrails shape user experience by disentangling the effects of user motivation from the manner in which refusals are delivered. By introducing a five-strategy refusal taxonomy and the QueryShift probe dataset, the authors show that alignment with user expectations drives perceived safety and satisfaction far more than whether the user is benign or malicious. A key finding is that partial compliance—providing general information without actionable details—consistently yields the best user experience, reducing negative perceptions by more than half compared with direct refusals. The work also reveals a misalignment between user preferences, LLM refusal behaviors, and reward-model training, underscoring the need to reorient guardrail design toward thoughtful refusals that preserve engagement while maintaining safety. Overall, the paper advocates a human-centered shift in AI safety focus from intent detection to crafting contextually appropriate refusals to improve safety and user trust.

Abstract

Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.

Paper Structure

This paper contains 50 sections, 22 figures, 14 tables.

Figures (22)

  • Figure 1: We investigate the contextual effects of LLM guardrails on user experience: how different response strategies (left) affect user perceptions (right) when users have either benign or malicious motivations (center). Our taxonomy includes five response strategies: direct refusal, explanation-based refusal, redirection, partial compliance, and full compliance. We measure perceptions across three dimensions: perceived model behavior, ethical judgments, and affective responses.
  • Figure 2: Example user study flow for the chatbot interaction corresponding to safety category Offensive Language (top left). Participants select topics from a given list (middle) and read the given motivation (benign or malicious). The model's response strategy is determined by the experimental condition: in aligned settings, benign queries receive full compliance while malicious queries receive the assigned refusal strategy; in misaligned settings, this pattern is reversed. Participants immediately evaluate each response across multiple perception dimensions (right).
  • Figure 3: Effect sizes ($\eta^2$) of predictors on user perceptions. Each bar represents the proportion of variance explained by one predictor for a given perception variable. Alignment consistently shows the strongest effect across all perceptions, while Compliance contributes moderately to positive perceptions like Helpfulness and Satisfaction.
  • Figure 4: OLS regression coefficients showing the effect of each refusal strategy on user perceptions relative to full compliance. All refusal strategies lead to significantly negative user reactions, with Part being the most favorable. Error bars represent 95% confidence intervals. Significance levels: .$p<.1$, *$p<.05$, **$p<.01$, ***$p<.001$.
  • Figure 5: Distribution of response strategies across QueryShift and CASE-Bench settings under three settings: query only (no motivation), query with benign motivation, and query with malicious motivation.
  • ...and 17 more figures