How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Ryan Liu; Theodore R. Sumers; Ishita Dasgupta; Thomas L. Griffiths

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths

TL;DR

This work formalizes and empirically probes how large language models balance honesty and helpfulness in conversation by grounding the analysis in Gricean maxims and Rational Speech Acts, and by employing a signaling-bandits paradigm to quantify trade-offs. It systematically compares training and prompting strategies, notably RLHF and Chain-of-Thought prompting, across multiple models and realistic scenarios. Key findings show RLHF consistently boosts both honesty and helpfulness, while Chain-of-Thought prompting tends to increase helpfulness at potential costs to honesty, with GPT-4 Turbo exhibiting human-like, frame-sensitive steerability. The results illuminate the internalized conversational values of LLMs and demonstrate that these abstract preferences can be steered by prompting, shaping practical guidance for deploying safe and aligned conversational agents.

Abstract

In day-to-day communication, people often approximate the truth - for example, rounding the time or omitting details - in order to be maximally helpful to the listener. How do large language models (LLMs) handle such nuanced trade-offs? To address this question, we use psychological models and experiments designed to characterize human behavior to analyze LLMs. We test a range of LLMs and explore how optimization for human preferences or inference-time reasoning affects these trade-offs. We find that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-of-thought prompting skews LLMs towards helpfulness over honesty. Finally, GPT-4 Turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context. Our findings reveal the conversational values internalized by LLMs and suggest that even these abstract values can, to a degree, be steered by zero-shot prompting.

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

TL;DR

Abstract

Paper Structure (24 sections, 9 equations, 5 figures, 4 tables)

This paper contains 24 sections, 9 equations, 5 figures, 4 tables.

Introduction
Background
Training and prompting LLMs
Measuring alignment to to human values
Formalizing Helpfulness and Honesty
Experimental Setup
Experiment 1: Choosing an utterance
Experiment 2: To say or not to say
Experiment 3: Everyday settings
Discussion
Prompts
Experiment 1
Experiment 2
Experiment 3
Housing
...and 9 more sections

Figures (5)

Figure 1: We test LLMs in the signaling bandits paradigm, an extension of classic Lewis reference games lewis1969convention to contextual bandit settings. A: The world state $w$ is a reward vector over mushroom features. B: Speakers know the reward function and produce utterances about feature values to inform the listener's decision-making. C: Utterances are truthful if they reflect the actual value of a feature, and helpful if they improve the expected utility of the listener's policy $\pi_L$. Not all true utterances are helpful, and vice versa. "Spots are +1" is both true and helpful; "Spots are +2" is false but helpful. "Red is 0" and "Green is +2" are both true but not helpful. Figure reproduced with permission from sumers2023reconciling.
Figure 2: Experiment 1 results. A: shows that RLHF yields strictly better performance on both honesty and helpfulness. Larger models (GPT-4) recognize and satisfy both. B: shows that CoT increases helpfulness in smaller models, often at the cost of honesty. GPT-4 and GPT-4 Turbo are always truthful and generally helpful. C: Results from fitting the psychological model. The y-axis shows the posterior mean for the $\lambda$ parameter: CoT causes LLMs to weight helpfulness substantially more.
Figure 3: Results from fitting the psychological model in Experiment 2. The endorsement paradigm reveals how models trade off honesty and helpfulness. We additionally explore the steerability of models by biasing their prompts towards honesty or helpfulness. The y-axis shows the posterior mean over the $\lambda$ parameter. This reveals a strong preference towards helpfulness in LLaMA; intriguingly, the base Mixtral model strongly prefers helpfulness but the post-RLHF model prefers honesty. We find that GPT-4 and GPT-4 Turbo with chain-of-thought are, in fact, steerable: while they heavily favor honesty given a neutral prompt, their objective weighting is responsive to helpfulness prompts in a human-like manner. Smaller models are not steerable.
Figure 4: Visualizing the helpfulness-honesty trade-off in GPT-4 Turbo. Top row shows human participants from sumers2023reconciling, middle row shows GPT-4 Turbo in the mushroom setting (Exp 2), bottom shows GPT-4 Turbo in the housing setting (Exp 3). Left: neutral prompt. Center: prompted to focus on honesty only. Right: prompted to focus on helpfulness only. Dots show individual trial responses and lines show nonparametric regressions. With a neutral prompt, GPT-4 Turbo refuses to endorse false utterances regardless of their helpfulness (Rows B and C, left column, blue lines). However, it is highly sensitive to the helpfulness of true utterances, endorsing only when helpful (Rows B and C, left column, orange lines). Prompted for honesty only, it successfully adjusts in the Housing setting (Row C, center) but not Mushrooms (Row B, center). Finally, prompted for helpfulness only, it is willing to endorse false utterances as their helpfulness increases (Rows B and C, right column) --- resembling humans' default response (Row A, left column).
Figure 5: Results from fitting the psychological model to LLMs in Experiment 3, using more realistic cover stories (compare to Fig. \ref{['fig:exp-2']}). The y-axis shows the posterior mean $\lambda$ parameter. GPT-4 Turbo --- but not 3.5 or 4 --- displays similar steerability.

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

TL;DR

Abstract

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)