Table of Contents
Fetching ...

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Kurt Thomas, Patrick Gage Kelley, David Tao, Sarah Meiklejohn, Owen Vallis, Shunwen Tan, Blaž Bratanič, Felipe Tiengo Ferreira, Vijay Kumar Eranti, Elie Bursztein

TL;DR

This work investigates using large language models to assist human moderators in identifying harmful online content across hate speech, harassment, violent extremism, and election misinformation. It introduces five collaborative design patterns and a single, flexible prompting approach that supports pre-filtering, rapid escalation, autonomous rating, and human-assisted decision making. On a 50,000-comment dataset, the best prompts achieve near-human accuracy and enable substantial real-world impact, including automating 41.5% of review decisions and improving human verdict precision and recall by 9–11% in pilot deployments. The findings demonstrate a practical path to scaling trusted moderation, improving consistency, and reducing the emotional burden on human raters while acknowledging risks such as prompt manipulation, model drift, and bias that require ongoing safeguards and evaluation.

Abstract

In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity, and a 9--11% increase (absolute) in precision and recall for detecting violative content.

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

TL;DR

This work investigates using large language models to assist human moderators in identifying harmful online content across hate speech, harassment, violent extremism, and election misinformation. It introduces five collaborative design patterns and a single, flexible prompting approach that supports pre-filtering, rapid escalation, autonomous rating, and human-assisted decision making. On a 50,000-comment dataset, the best prompts achieve near-human accuracy and enable substantial real-world impact, including automating 41.5% of review decisions and improving human verdict precision and recall by 9–11% in pilot deployments. The findings demonstrate a practical path to scaling trusted moderation, improving consistency, and reducing the emotional burden on human raters while acknowledging risks such as prompt manipulation, model drift, and bias that require ongoing safeguards and evaluation.

Abstract

In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity, and a 9--11% increase (absolute) in precision and recall for detecting violative content.
Paper Structure (70 sections, 11 figures, 10 tables)

This paper contains 70 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Platforms rely on human raters to label data for supervised abuse classifiers, to triage user reports of suspected policy violations, and to determine the validity of user appeals that claim content is non-violative. We envision an LLM agent that optimizes which content gets sent to human raters and assists human raters in arriving at decisions.
  • Figure 2: Design patterns for using an LLM to assist human raters. We separate these into designs that attempt to optimize which content is sent to human raters (❶, ❷, ❸) and designs that attempt to improve the accuracy of human raters (❹, ❺).
  • Figure 3: Initial prompt derived from Google's public policy around census and election misinformation. Across all our experiments, this policy language remains static.
  • Figure 4: A few-shot prompt variant that includes both an example policy-relevant comment and answer, and keyword context. The comment under evaluation appears after all examples.
  • Figure 5: Accuracy of text-unicorn for our hand-picked few-shot prompt variant. We segment our evaluation corpus into buckets of 0--9 characters, 10--19 characters, and so on up to 100+ characters. For each sample, we display error margins for a confidence level of 95%. (Note the truncated Y-axis.)
  • ...and 6 more figures