Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

Adit Jain; Vikram Krishnamurthy

Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

Adit Jain, Vikram Krishnamurthy

TL;DR

The paper studies autonomous, language-driven Bayesian agents that sequentially detect a hidden state from high-dimensional text observations using LLMs as sensors. It proves that such agents exhibit finite-time information cascades and herding, and it develops a stopping-time stochastic-control framework to delay herd by selectively revealing private observations, yielding a threshold structure in the optimal policy. The framework is applied to flagging hate-speech peddlers on online platforms, with numerical experiments showing information cascades and the impact of threshold delays. The work highlights privacy-utility tradeoffs in sequential LDDAs, offering practical insights for deploying LLM-based detectors in sensitive online moderation tasks.

Abstract

This paper studies the problem of autonomous agents performing Bayesian social learning for sequential detection when the observations of the state belong to a high-dimensional space and are expensive to analyze. Specifically, when the observations are textual, the Bayesian agent can use a large language model (LLM) as a map to get a low-dimensional private observation. The agent performs Bayesian learning and takes an action that minimizes the expected cost and is visible to subsequent agents. We prove that a sequence of such Bayesian agents herd in finite time to the public belief and take the same action disregarding the private observations. We propose a stopping time formulation for quickest time herding in social learning and optimally balance privacy and herding. Structural results are shown on the threshold nature of the optimal policy to the stopping time problem. We illustrate the application of our framework when autonomous Bayesian detectors aim to sequentially identify if a user is a hate speech peddler on an online platform by parsing text observations using an LLM. We numerically validate our results on real-world hate speech datasets. We show that autonomous Bayesian agents designed to flag hate speech peddlers in online platforms herd and misclassify the users when the public prior is strong. We also numerically show the effect of a threshold policy in delaying herding.

Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

TL;DR

Abstract

Paper Structure (22 sections, 2 theorems, 19 equations, 4 figures, 1 algorithm)

This paper contains 22 sections, 2 theorems, 19 equations, 4 figures, 1 algorithm.

Introduction
Main Results
Motivation
Organization
Social Learning for language-driven decision-making agents (LDDAs)
Main Result: Herding in Social Learning
Stochastic Control Approach for Delayed Herding in LDDAs
Socialistic Objective for Stopping Time Problem
Structural Results
Key Application and Numerical Results: Flagging Hate Speech Peddlers on Online Platforms
LDDAs detecting hate speech peddlers on online platforms: Motivation and Experimental Setup
Numerical Demonstration of Herding
Threshold policy for delayed herding in stopping time problem
Conclusion
Proof for Theorem \ref{['th:herding']}
...and 7 more sections

Key Result

Theorem 1

(Herding in Bayesian social learning of LDDAs) The social learning protocol of the LDDAs described in Algorithm alg:sociallearning leads to an information cascade (Def. def:infocascade) and agents herd (Def. def:herding) in finite time $K<\infty$ with probability 1.

Figures (4)

Figure 1: The language-driven decision-making agent (LDDA) modeled in the text has two components: a) a large language model (LLM) which acts as a noisy sensor to provide a low-dimensional observation $y_k$ for the true state $x_k$ by parsing the high-dimensional text observation $y^\prime_k$ and b) a neural Bayesian engine which uses Bayes rules to update the belief about the state using \ref{['eq:bayesrule']}. A neural network parameterizes the likelihood. The LDDA outputs the action minimizing the expected cost \ref{['eq:action']} and uses the actions of the previous LDDAs to update the prior using \ref{['eq:priorupdate']}.
Figure 2: Schematic representation of social learning with language-driven decision-making agents (LDDA) with LLM sensors for compressing the observations, ($y^\prime_m$) and estimating the underlying states, ($x_m$). The LDDA updates their prior over the state space using past actions ($u_m$) and likelihood (parameterized by a neural network) for the compressed observations.
Figure 3: Herding in LDDAs. For different underlying states, the average action of the agents is $0$ for a strong initial prior probability. Hence if the public prior is strong, the LDDAs misclassify the user as a non hate speech peddler even in presence of private evidence otherwise.
Figure 4: Effect of different thresholds in the policy of form \ref{['eq:thresholdpolicy']}. For the state $x=1$, it can be seen that increasing the policy threshold delays herding in regions of strong priors (left and right). The optimal threshold, which can be searched using stochastic approximation, is a threshold that optimally delays herding to balance the delay, error, and social learning costs of \ref{['eq:socialwelfarecost']}.

Theorems & Definitions (12)

Definition 1
Definition 2
Theorem 1
proof
Theorem 2
proof
proof
proof
Definition 3
Definition 4
...and 2 more

Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

TL;DR

Abstract

Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)