Table of Contents
Fetching ...

People readily follow personal advice from AI but it does not improve their well-being

Lennart Luettgau, Vanessa Cheung, Magda Dubois, Keno Juechems, Jessica Bergs, Henry Davidson, Bessie O'Dell, Hannah Rose Kirk, Max Rollwage, Christopher Summerfield

TL;DR

The paper investigates whether people follow personal advice from consumer LLMs and whether such guidance impacts well-being. Using a large, representative UK sample (N=2302) in a longitudinal RCT with a 2x2x2 factorial design plus a control, participants engaged in a 20-minute GPT-4o conversation across health, career, or relationship domains, with 2–3 week follow-ups. The study employs an LLM-based harm autograder and Bayesian GLMs to examine advice content, adherence determinants, and psychological outcomes, finding high adherence but no lasting well-being benefits; personalization increases likelihood of following advice, and safety safeguards yield very low harm rates. These findings imply that while AI advice can influence real-world decisions, it offers limited sustained psychological value, underscoring the need for governance and safety considerations as LLMs become common personal-advisors.

Abstract

People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 2,302), 75% of participants who had a 20-minute discussion with GPT-4o about health, careers or relationships subsequently reported following its advice. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. When queried 2-3 weeks later, participants who had interacted with personalised AI (with access to detailed user information) followed its advice more often in the real world and reported higher well-being than those advised by non-personalised AI. However, while receiving personal advice from AI temporarily reduced well-being, no differential long-term effects compared to a control emerged. Our results suggest that humans readily follow LLM advice about personal issues but doing so shows no additional well-being benefit over casual conversations.

People readily follow personal advice from AI but it does not improve their well-being

TL;DR

The paper investigates whether people follow personal advice from consumer LLMs and whether such guidance impacts well-being. Using a large, representative UK sample (N=2302) in a longitudinal RCT with a 2x2x2 factorial design plus a control, participants engaged in a 20-minute GPT-4o conversation across health, career, or relationship domains, with 2–3 week follow-ups. The study employs an LLM-based harm autograder and Bayesian GLMs to examine advice content, adherence determinants, and psychological outcomes, finding high adherence but no lasting well-being benefits; personalization increases likelihood of following advice, and safety safeguards yield very low harm rates. These findings imply that while AI advice can influence real-world decisions, it offers limited sustained psychological value, underscoring the need for governance and safety considerations as LLMs become common personal-advisors.

Abstract

People increasingly seek personal advice from large language models (LLMs), yet whether humans follow their advice, and its consequences for their well-being, remains unknown. In a longitudinal randomised controlled trial with a representative UK sample (N = 2,302), 75% of participants who had a 20-minute discussion with GPT-4o about health, careers or relationships subsequently reported following its advice. Based on autograder evaluations of chat transcripts, LLM advice rarely violated safety best practice. When queried 2-3 weeks later, participants who had interacted with personalised AI (with access to detailed user information) followed its advice more often in the real world and reported higher well-being than those advised by non-personalised AI. However, while receiving personal advice from AI temporarily reduced well-being, no differential long-term effects compared to a control emerged. Our results suggest that humans readily follow LLM advice about personal issues but doing so shows no additional well-being benefit over casual conversations.

Paper Structure

This paper contains 2 sections, 23 equations, 12 figures, 1 table.

Table of Contents

  1. Introduction
  2. Results

Figures (12)

  • Figure 1: A. Schematic of the experimental design and study procedure on both Session 1 and Session 2, including details of the randomisation and tests administered. B. Example pathway-specific questions (administered on Session 1). C. Advice density among chatbot utterances (LLM classification) across control and experimental conditions, D. Advice density across experimental conditions (Safety, Actionability, Personal Information). Large black dots show means, error bars are 95% confidence intervals, small grey dots are individual participant datapoints.
  • Figure 2: A. Self-reported advice received (Session 1, immediately after the conversation) and advice followed (Session 2) across control (brown dots) and experimental conditions (green dots); large dots show means, error bars are 95% confidence intervals. B. Percentage of advice-following across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parameter estimates for the main effects of experimental conditions and their interaction on advice-following at Session 2; dots are posterior means, error bars represent 95% HPDI, coloured dots and error bars denote effects that are non-zero (HPDI does not contain with 0). D. Advice density, separately for participants who followed the chatbot advice vs. did not follow the advice; large black dots show means, error bars are 95% confidence intervals, small grey dots are individual participant datapoints. E. Bayesian GLM posterior parameter estimates for effects of advice density on advice-following on Session 2, separate for the experimental conditions; dots are posterior means, error bars represent 95% HPDI.
  • Figure 3: A. Self-reported advice-following (dark blue) and advice received (light blue) counts, categorised by themes derived from LLM-based content analysis. B. Self-reported advice-following percentage across levels of problem severity derived from PCA scores combining self- and LLM autograder-assessed problem severity. Note that these analyses only include participants in the experimental group, as control group participants were not asked to assess problem severity of the topic discussed with the AI (hobbies/interests). C. Self-reported advice-following percentage across different "stakes" of the advice (derived from PCA scores, computed from LLM autograder-assessed reversibility and consequentiality of the advice and how much time it would take to implement the advice), separately for the experimental (green dots) and control group (brown dots). Note that for visualisation, we restricted the stakes plot to bins that included more than 5 participants to ensure reliable mean/CI estimation (this affected in total 7 participants in the "very high" bin). In both B and C, dots show means, errors bars are 95% confidence intervals. D. Correlation of problem severity and stakes (overall Pearson correlation: $r = .18$, $p < .001$), similar in participants who followed (green) and those who did not follow (red) the AI advice ($r = .20$, $p < .001$ vs $r = .12$, $p = .009$; Fisher's $z = 1.51$, $p = .130$). E. Bayesian GLM posterior parameter estimates for the effects of sociodemographic variables, problem severity and stakes on advice-following, extracted from the best fitting GLM (including quadratic terms for problem severity), dots are posterior means, error bars represent 95% HPDIs. F. Bayesian GLM posterior parameter estimates for effects of several participant rated advice qualities (Session 1), sycophancy and user engagement on advice-following. GLMs were computed only using data from the experimental group.
  • Figure 4: A. Average subjective advice value (Session 2) across control (brown dots) and experimental conditions (green dots), separately for participants who followed vs did not follow the advice, large dots show means, error bars are 95% confidence intervals. B. Average subjective advice value (Session 2) across experimental conditions (Safety, Actionability, Personal Information). C. Bayesian GLM posterior parameter estimates for the effects of experimental conditions, advice-following and their interaction on subjective advice value (dots are posterior means, error bars represent 95% HPDI), coloured dots and error bars denote effects that are non-zero (HPDI does not contain with 0).
  • Figure 5: A. Well-being factor scores over timepoints across experimental and control conditions, separately for participants who followed and did not follow AI advice (well-being factor scores from factor analysis based on PHQ, GAD, SSS, JSS, WHO-5, ONS, SWBS, JAWS, PANAS, Affect grid arousal and valence; see Supplementary Fig. \ref{['fig:efa_results']}; Session 1 POST: Session 1 POST -- Session 1 PRE; Session 2: Session 2 -- Session 1 PRE). B. Short- and long-term well-being changes across experimental conditions. C. Clinical threshold (score $\geq 3$) crossing transitions for PHQ-2 (depression) from Session 1 PRE to Session 2. Participants are categorised according to four state transitions: stayed well (remained below clinical threshold), improved (crossed below threshold), deteriorated (crossed above threshold), and stayed unwell (remained above threshold). Distribution shown separately for control and experimental conditions. D. Best fitting Bayesian GLM posterior parameter estimates for well-being changes. E. Best fitting Bayesian GLM posterior parameter estimates for well-being changes including the data of all experimental conditions (safety, advice, personal information) shown in B. Dots are posterior means, error bars represent 95% Highest Posterior Density Intervals (HPDI), coloured dots and error bars denote effects that are non-zero (HPDI does not contain 0). F. Clinical threshold (score $\geq 3$) crossing transitions for GAD-2 (anxiety) from Session 1 PRE to Session 2.
  • ...and 7 more figures