Table of Contents
Fetching ...

AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Minbeom Kim, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung

TL;DR

AdvisorQA introduces a benchmark for subjective, personal-advice QA by leveraging LifeProTips upvote-based rankings to capture collective preferences. It defines two orthogonal evaluation axes—helpfulness (via majority preferences and a Plackett-Luce ranking framework) and harmlessness (via LifeTox)—and provides a dataset of 10,350 questions with rich, long-form responses. The dataset combines safe LifeProTips content with unsafe ULPT samples to study safety under training with supervised Fine-Tuning and RLHF, showing trade-offs between helpfulness and harmlessness across baseline models and training regimes. Experimental results reveal GPT-4 and human judgments align with AdvisorQA’s evaluation schema, while RLHF methods exhibit distinct balances between empathy, practicality, and safety, underscoring the need for nuanced controls in subjective AI advising.

Abstract

As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.

AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

TL;DR

AdvisorQA introduces a benchmark for subjective, personal-advice QA by leveraging LifeProTips upvote-based rankings to capture collective preferences. It defines two orthogonal evaluation axes—helpfulness (via majority preferences and a Plackett-Luce ranking framework) and harmlessness (via LifeTox)—and provides a dataset of 10,350 questions with rich, long-form responses. The dataset combines safe LifeProTips content with unsafe ULPT samples to study safety under training with supervised Fine-Tuning and RLHF, showing trade-offs between helpfulness and harmlessness across baseline models and training regimes. Experimental results reveal GPT-4 and human judgments align with AdvisorQA’s evaluation schema, while RLHF methods exhibit distinct balances between empathy, practicality, and safety, underscoring the need for nuanced controls in subjective AI advising.

Abstract

As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
Paper Structure (33 sections, 1 equation, 11 figures, 13 tables)

This paper contains 33 sections, 1 equation, 11 figures, 13 tables.

Figures (11)

  • Figure 1: The example of test set thread in AdvisorQA: It consists of an advice-seeking question and the advising answers sorted by their upvote rankings. LLM advice is evaluated by the trained helpfulness metric based on its ranking against human-written answers.
  • Figure 2: The distribution of average upvotes by rank of advice.
  • Figure 3: Visualization for topic distributions of advice-seeking questions in AdvisorQA. More detailed visualization is in Figure \ref{['fig:Q_vis_expand']}.
  • Figure 4: Analysis results of the primary value of evaluation metric: When GPT-4 and the PL model disagree on which advice is better, looking at situations where GPT-4 is right helps us understand what values it prioritizes differently from the PL model and vice versa. We surveyed these instances, sorting them into seven key values, to gather insights on what each model values most in their decisions.
  • Figure 5: Experimental results of baseline models performance in helpfulness and harmlessness.
  • ...and 6 more figures