Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Bastián González-Bustamante

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Bastián González-Bustamante

TL;DR

The paper evaluates how GPTs and open-source LLMs perform zero-shot toxicity and incivility annotation on political content, using a gold standard derived from human annotations on a large protest dataset. It compares Perspective API, OpenAI GPTs, and multiple open-source LLMs under a uniform prompt to classify messages as TOXIC or NONTOXIC. Key findings show Perspective with a lax threshold often matches or exceeds other models, while GPT-4o and Nous Hermes 2 Mixtral achieve the best F1 scores; several small open-source models offer competitive performance with faster local deployment. The work highlights reproducibility benefits of open-source models and argues for trade-offs among accuracy, computing time, and deployment location, with implications for scalable, privacy-preserving annotation in political discourse research.

Abstract

This article benchmarked the ability of OpenAI's GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google's Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM's zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

TL;DR

Abstract

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)