Table of Contents
Fetching ...

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Bastián González-Bustamante

TL;DR

The paper evaluates how GPTs and open-source LLMs perform zero-shot toxicity and incivility annotation on political content, using a gold standard derived from human annotations on a large protest dataset. It compares Perspective API, OpenAI GPTs, and multiple open-source LLMs under a uniform prompt to classify messages as TOXIC or NONTOXIC. Key findings show Perspective with a lax threshold often matches or exceeds other models, while GPT-4o and Nous Hermes 2 Mixtral achieve the best F1 scores; several small open-source models offer competitive performance with faster local deployment. The work highlights reproducibility benefits of open-source models and argues for trade-offs among accuracy, computing time, and deployment location, with implications for scalable, privacy-preserving annotation in political discourse research.

Abstract

This article benchmarked the ability of OpenAI's GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google's Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM's zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

TL;DR

The paper evaluates how GPTs and open-source LLMs perform zero-shot toxicity and incivility annotation on political content, using a gold standard derived from human annotations on a large protest dataset. It compares Perspective API, OpenAI GPTs, and multiple open-source LLMs under a uniform prompt to classify messages as TOXIC or NONTOXIC. Key findings show Perspective with a lax threshold often matches or exceeds other models, while GPT-4o and Nous Hermes 2 Mixtral achieve the best F1 scores; several small open-source models offer competitive performance with faster local deployment. The work highlights reproducibility benefits of open-source models and argues for trade-offs among accuracy, computing time, and deployment location, with implications for scalable, privacy-preserving annotation in political discourse research.

Abstract

This article benchmarked the ability of OpenAI's GPTs and a number of open-source LLMs to perform annotation tasks on political content. We used a novel protest event dataset comprising more than three million digital interactions and created a gold standard that includes ground-truth labels annotated by human coders about toxicity and incivility on social media. We included in our benchmark Google's Perspective algorithm, which, along with GPTs, was employed throughout their respective APIs while the open-source LLMs were deployed locally. The findings show that Perspective API using a laxer threshold, GPT-4o, and Nous Hermes 2 Mixtral outperform other LLM's zero-shot classification annotations. In addition, Nous Hermes 2 and Mistral OpenOrca, with a smaller number of parameters, are able to perform the task with high performance, being attractive options that could offer good trade-offs between performance, implementing costs and computing time. Ancillary findings using experiments setting different temperature levels show that although GPTs tend to show not only excellent computing time but also overall good levels of reliability, only open-source LLMs ensure full reproducibility in the annotation.
Paper Structure (12 sections, 3 figures, 3 tables)

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Jaccard Distance Heatmap between Gold Standard, Perspective API and Zero-Shot LLMs Classifiers
  • Figure 2: Average Performance, Number of Parameters and Computing Time of Zero-Shot LLMs Classifiers for Toxicity
  • Figure 3: Output Reliability Experiments of Zero-Shot LLMs Classifiers for Toxicity with Best Performance Models