Table of Contents
Fetching ...

Estimating Contribution Quality in Online Deliberations Using a Large Language Model

Lodewijk Gelauff, Mohak Goyal, Bhargav Dindukurthi, Ashish Goel, Alice Siu

TL;DR

This paper presents a scalable approach to estimating the quality of deliberative contributions using a Large Language Model (LLM), validated against human annotations across large online deliberation platforms. By defining four criteria (justification, novelty, expansion, and potential for further development) and prompting an LLM with preceding context, the authors obtain quality scores that are competitive with human raters and far more cost-efficient. The study demonstrates the utility of automated quality estimates in evaluating platform interventions, notably nudges designed to increase participation, showing nudges raise speaking probability without materially compromising quality. The work highlights practical implications for real-time moderation and design of large-scale deliberative systems, while acknowledging risks related to model bias and prompt sensitivity and outlining avenues for future refinement, including mean corrections and targeted interventions.

Abstract

Deliberation involves participants exchanging knowledge, arguments, and perspectives and has been shown to be effective at addressing polarization. The Stanford Online Deliberation Platform facilitates large-scale deliberations. It enables video-based online discussions on a structured agenda for small groups without requiring human moderators. This paper's data comes from various deliberation events, including one conducted in collaboration with Meta in 32 countries, and another with 38 post-secondary institutions in the US. Estimating the quality of contributions in a conversation is crucial for assessing feature and intervention impacts. Traditionally, this is done by human annotators, which is time-consuming and costly. We use a large language model (LLM) alongside eight human annotators to rate contributions based on justification, novelty, expansion of the conversation, and potential for further expansion, with scores ranging from 1 to 5. Annotators also provide brief justifications for their ratings. Using the average rating from other human annotators as the ground truth, we find the model outperforms individual human annotators. While pairs of human annotators outperform the model in rating justification and groups of three outperform it on all four metrics, the model remains competitive. We illustrate the usefulness of the automated quality rating by assessing the effect of nudges on the quality of deliberation. We first observe that individual nudges after prolonged inactivity are highly effective, increasing the likelihood of the individual requesting to speak in the next 30 seconds by 65%. Using our automated quality estimation, we show that the quality ratings for statements prompted by nudging are similar to those made without nudging, signifying that nudging leads to more ideas being generated in the conversation without losing overall quality.

Estimating Contribution Quality in Online Deliberations Using a Large Language Model

TL;DR

This paper presents a scalable approach to estimating the quality of deliberative contributions using a Large Language Model (LLM), validated against human annotations across large online deliberation platforms. By defining four criteria (justification, novelty, expansion, and potential for further development) and prompting an LLM with preceding context, the authors obtain quality scores that are competitive with human raters and far more cost-efficient. The study demonstrates the utility of automated quality estimates in evaluating platform interventions, notably nudges designed to increase participation, showing nudges raise speaking probability without materially compromising quality. The work highlights practical implications for real-time moderation and design of large-scale deliberative systems, while acknowledging risks related to model bias and prompt sensitivity and outlining avenues for future refinement, including mean corrections and targeted interventions.

Abstract

Deliberation involves participants exchanging knowledge, arguments, and perspectives and has been shown to be effective at addressing polarization. The Stanford Online Deliberation Platform facilitates large-scale deliberations. It enables video-based online discussions on a structured agenda for small groups without requiring human moderators. This paper's data comes from various deliberation events, including one conducted in collaboration with Meta in 32 countries, and another with 38 post-secondary institutions in the US. Estimating the quality of contributions in a conversation is crucial for assessing feature and intervention impacts. Traditionally, this is done by human annotators, which is time-consuming and costly. We use a large language model (LLM) alongside eight human annotators to rate contributions based on justification, novelty, expansion of the conversation, and potential for further expansion, with scores ranging from 1 to 5. Annotators also provide brief justifications for their ratings. Using the average rating from other human annotators as the ground truth, we find the model outperforms individual human annotators. While pairs of human annotators outperform the model in rating justification and groups of three outperform it on all four metrics, the model remains competitive. We illustrate the usefulness of the automated quality rating by assessing the effect of nudges on the quality of deliberation. We first observe that individual nudges after prolonged inactivity are highly effective, increasing the likelihood of the individual requesting to speak in the next 30 seconds by 65%. Using our automated quality estimation, we show that the quality ratings for statements prompted by nudging are similar to those made without nudging, signifying that nudging leads to more ideas being generated in the conversation without losing overall quality.
Paper Structure (32 sections, 1 equation, 27 figures, 4 tables)

This paper contains 32 sections, 1 equation, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Fraction of data points where groups of humans are outperformed by the model. The model 'wins' if it rates a statement closer to the golden rating than a group of humans.
  • Figure 2: Fraction of groups of humans outperformed by the model. The model 'wins' if it rates more statements closer to the golden rating than a group of humans.
  • Figure 3: Average scores received in the evaluation of rating-justification pairs by individual annotators on the 1-5 Likert scale. S1 and S2 are the contribution sets. The model's results are marked by stars. Each annotator has a unique color used across all criteria.
  • Figure 4: Average scores received in evaluations of rating-justification pairs on the 1-5 Likert scale.
  • Figure 5: Average room quality on Q1 (justification) and Q2 (novelty) for all rooms in events E1, E2, and E3. The centroid corresponds to the average of the room quality for each room in the event.
  • ...and 22 more figures