Assessing the Human Likeness of AI-Generated Counterspeech
Xiaoying Song, Sujana Mamidisetty, Eduardo Blanco, Lingzi Hong
TL;DR
The paper investigates whether AI-generated counterspeech resembles human responses and how this human likeness influences effectiveness. It implements four LLM-based generation strategies and evaluates them against human-written counterspeech from Reddit and crowd workers, using authorship attribution and human judgments. The study finds that AI-generated counterspeech are easily distinguishable from human-written text, though fine-tuning with relevant datasets increases human likeness, and reveals systematic differences in linguistic features, politeness, and specificity. These findings inform the design of safer, more effective counterspeech systems and provide a publicly available dataset for further research.
Abstract
Counterspeech is a targeted response to counteract and challenge abusive or hateful content. It effectively curbs the spread of hatred and fosters constructive online communication. Previous studies have proposed different strategies for automatically generated counterspeech. Evaluations, however, focus on relevance, surface form, and other shallow linguistic characteristics. This paper investigates the human likeness of AI-generated counterspeech, a critical factor influencing effectiveness. We implement and evaluate several LLM-based generation strategies, and discover that AI-generated and human-written counterspeech can be easily distinguished by both simple classifiers and humans. Further, we reveal differences in linguistic characteristics, politeness, and specificity. The dataset used in this study is publicly available for further research.
