Table of Contents
Fetching ...

When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being

Harsh Kumar, Jasmine Chahal, Yinuo Zhao, Zeling Zhang, Annika Wei, Louis Tay, Ashton Anderson

TL;DR

The paper examines whether frontier LLMs provide higher-quality everyday well-being advice than crowdsourced Reddit replies and explores how lightweight human–AI collaboration can augment advice. Through two pre-registered studies, it compares GPT-4o and GPT-5 to top Reddit comments, plus augmentation pipelines, and finishes with a survey of undergraduates to surface user preferences for coach-like versus friend-like AI personas. The findings show frontier LLMs generally outperform crowds (with GPT-4o often leading) on single-shot advice, but gains do not automatically translate into better overall advice; simple benchmark improvements can alter perceived quality, and user preferences vary by persona and trust. The work also demonstrates that human edits and expert input can meaningfully shape AI-generated advice, pointing to design patterns for hybrid ecosystems that balance quality, transparency, and safety in advice-giving technology. These results inform practical design implications for deploying advice agents and ecosystems that blend AI, crowds, and expert oversight.

Abstract

Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.

When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being

TL;DR

The paper examines whether frontier LLMs provide higher-quality everyday well-being advice than crowdsourced Reddit replies and explores how lightweight human–AI collaboration can augment advice. Through two pre-registered studies, it compares GPT-4o and GPT-5 to top Reddit comments, plus augmentation pipelines, and finishes with a survey of undergraduates to surface user preferences for coach-like versus friend-like AI personas. The findings show frontier LLMs generally outperform crowds (with GPT-4o often leading) on single-shot advice, but gains do not automatically translate into better overall advice; simple benchmark improvements can alter perceived quality, and user preferences vary by persona and trust. The work also demonstrates that human edits and expert input can meaningfully shape AI-generated advice, pointing to design patterns for hybrid ecosystems that balance quality, transparency, and safety in advice-giving technology. These results inform practical design implications for deploying advice agents and ecosystems that blend AI, crowds, and expert oversight.

Abstract

Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.

Paper Structure

This paper contains 47 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview of data collection and evaluation pipeline for Study-1. We begin by sourcing popular advice-seeking posts and top-rated comments from r/getdisciplined, then generate matching LLM responses. Expert annotators subsequently compare human and AI advice through Likert-scale ratings and ranking-based judgments.
  • Figure 2: Study interface. Left: original Reddit-style post shown before rating. Middle: example LLM advice (GPT-4o) with the six Likert items and a one-line improvement prompt. Right: example best-human advice (Reddit–Top) with the same items. Each participant completed two scenarios; for each scenario they rated four anonymized responses (source labels hidden in the study; labels added here for exposition) and then completed the two rankings (overall effectiveness; long-run impact).
  • Figure 3: Expert ratings by comment source (Mean $\pm$ SEM) for Study-1. Bars show mean Likert scores (1–7) across participants for each evaluation dimension, grouped by source: Reddit-Top, Reddit-90th, GPT-4o, and GPT-5 (humans shown in grayscale, AIs in color). Higher values indicate stronger endorsement on that dimension; for the sycophancy item, higher values indicate a greater emphasis on pleasing the recipient over offering objective guidance.
  • Figure 4: Preference and AI detection by comment source for Study-1.(a) Probability of superiority (PS): Left facet shows overall best; right facet shows long-term benefit. For each ranking instruction, PS is the probability that a response from a given source is ranked above a response from a competing source in pairwise comparisons within the same scenario. For each participant and source, PS is computed as wins/(wins+losses) from the 1–4 rank order, then averaged across participants; error bars show SEM. (b) Perceived as AI-generated: For each participant and source, the AI detection rate is the proportion of the two exposures in which the response was flagged as AI-generated; bars show the mean across participants with SEM.
  • Figure 5: Human-LLM collaboration pipeline for Study-2. Augmenting original responses grouped by seed source: Reddit-Top and GPT-4o. Intervention 1 only augments with an LLM, while intervention 2 augments with an LLM guided by expert feedback from Study-1.
  • ...and 6 more figures