Table of Contents
Fetching ...

Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors

Anthony Sicilia, Malihe Alikhani

TL;DR

To what extent model uncertainty can be used as a tool to mitigate potential biases is explored, for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.

Abstract

Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it's unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.

Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors

TL;DR

To what extent model uncertainty can be used as a tool to mitigate potential biases is explored, for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.

Abstract

Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it's unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.

Paper Structure

This paper contains 39 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Two difficult social media moderation examples. Both instances appear as if they may derail, leading to harmful user behaviors. Yet, only one does. These are real examples from the moderation corpora we study, identified using this https://awry.infosci.cornell.edu
  • Figure 2: F1 v. Bias for all models / datasets with different inferences strategies. CoT refers to our standard conversation forecasting prompt (i.e., which uses CoT), while uncertain CoT ask the model to represent it's uncertainty in place of direct classification. Scaling refers to post-hoc scaling and is only applicable to the former strategy. It is best to have near 0 bias and high F1 score.
  • Figure 3: Statistical Bias of Forecasts on Reddit for Mistral models and Qwen2. Language models either use uncertainty estimates to report inferences (uncertain CoT) or make traditional binary decsions (CoT). Impact of post-hoc scaling is also shown for the former of these methods. Topics are determined using the method from § \ref{['sec:meth']}.