A Few Hypocrites: Few-Shot Learning and Subtype Definitions for Detecting Hypocrisy Accusations in Online Climate Change Debates
Paulina Garcia Corral, Avishai Green, Hendrik Meyer, Anke Stoll, Xiaoyue Yan, Myrthe Reuver
TL;DR
This work treats hypocrisy accusation detection as its own NLP task within online climate discourse and introduces the Climate Hypocrisy Accusation Corpus (CHAC), a 420-comment dataset annotated by experts into personal moral hypocrisy and political hypocrisy. Through six-shot in-context learning across GPT-4o, GPT-3.5, and Llama-3, the study shows that newer instruction-tuned models achieve meaningful detection performance (macro-F1 ≈ 0.67–0.68), with personal hypocrisy easier to identify than political hypocrisy. The paper provides a careful error analysis, revealing systematic challenges such as false positives driven by mentions of 'hypocrisy', false negatives for older models, and subtype misclassification, especially for political content. By releasing CHAC and detailing an annotation scheme and experimental protocol, the work enables scalable, domain-specific analysis of hypocrisy in climate debates and highlights directions for future improvement and broader applicability in social science text analysis.
Abstract
The climate crisis is a salient issue in online discussions, and hypocrisy accusations are a central rhetorical element in these debates. However, for large-scale text analysis, hypocrisy accusation detection is an understudied tool, most often defined as a smaller subtask of fallacious argument detection. In this paper, we define hypocrisy accusation detection as an independent task in NLP, and identify different relevant subtypes of hypocrisy accusations. Our Climate Hypocrisy Accusation Corpus (CHAC) consists of 420 Reddit climate debate comments, expert-annotated into two different types of hypocrisy accusations: personal versus political hypocrisy. We evaluate few-shot in-context learning with 6 shots and 3 instruction-tuned Large Language Models (LLMs) for detecting hypocrisy accusations in this dataset. Results indicate that the GPT-4o and Llama-3 models in particular show promise in detecting hypocrisy accusations (F1 reaching 0.68, while previous work shows F1 of 0.44). However, context matters for a complex semantic concept such as hypocrisy accusations, and we find models struggle especially at identifying political hypocrisy accusations compared to personal moral hypocrisy. Our study contributes new insights in hypocrisy detection and climate change discourse, and is a stepping stone for large-scale analysis of hypocrisy accusation in online climate debates.
