Table of Contents
Fetching ...

Comparing human and LLM politeness strategies in free production

Haoran Zhao, Robert D. Hawkins

TL;DR

This work interrogates whether large language models can reproduce human politeness strategies and adapt them to context. By combining constrained, multiple-choice tasks with open-ended generation, it demonstrates that models with $ ext{≥}70$B parameters exhibit substantial pragmatic competence and can even be preferred by human evaluators when generating open-ended responses. However, linguistic analyses reveal systematic differences: LLMs overuse negative politeness (minimizing imposition) across contexts, potentially causing pragmatic misinterpretations and alignment concerns. The findings illuminate how training objectives shape subtle relational patterns in AI, with implications for safer, more interpretable human–AI interactions and for refining pragmatics-aligned instruction and evaluation. Overall, the work provides a nuanced view of pragmatic alignment in AI systems, balancing evidence of human-like competence with caution about underlying strategy biases.

Abstract

Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

Comparing human and LLM politeness strategies in free production

TL;DR

This work interrogates whether large language models can reproduce human politeness strategies and adapt them to context. By combining constrained, multiple-choice tasks with open-ended generation, it demonstrates that models with B parameters exhibit substantial pragmatic competence and can even be preferred by human evaluators when generating open-ended responses. However, linguistic analyses reveal systematic differences: LLMs overuse negative politeness (minimizing imposition) across contexts, potentially causing pragmatic misinterpretations and alignment concerns. The findings illuminate how training objectives shape subtle relational patterns in AI, with implications for safer, more interpretable human–AI interactions and for refining pragmatics-aligned instruction and evaluation. Overall, the work provides a nuanced view of pragmatic alignment in AI systems, balancing evidence of human-like competence with caution about underlying strategy biases.

Abstract

Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models (70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

Paper Structure

This paper contains 43 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: (A) Correlations between human and model response probabilities for the top 4 models with specific prompting strategies we tested. Both the base and instruct-tuned versions of Qwen2.5-72B are shown here for comparison. Error bars are 95% confidence intervals across vignettes. (B) Comparing the pattern of human and LLM responses across different communicative goals and ratings. Model results are from Llama-3.3-70B-Instruct using the multi-choice-persona prompting strategy; human responses are from yoon20.
  • Figure 2: Human evaluation results. The bars show the relative preference (50% is chance). Bars above the 50% line indicate the percentage to which responses are preferred as expected, and below indicate the percentage to which responses are preferred as unexpected. (A) Evaluators systematically prefer LLM generations over human generations. (B) Both humans and LLMs are sensitive to goals and (C) ratings. Error bars are bootstrapped 95% confidence intervals.
  • Figure 3: Proportion of different politeness strategies across ratings and goals for (A) human and (B) LLMs.
  • Figure 4: Pipeline for comparing open-ended polite speech generation in humans and LLMs. Our study consists of two stages: an initial stage where we elicit responses for a variety of scenarios and a second stage where we ask a naive group to evaluation which of these responses they prefer.
  • Figure 5: (A) Distributions of how often the "was/wasn't terrible/bad/good/amazing" template studied by yoon20 was spontaneously produced by participants under each goal and rating. (B) How often responses use negation as a strategy among the responses that apply the yoon20 format under each goal and rating.