Table of Contents
Fetching ...

Populism Meets AI: Advancing Populism Research with LLMs

Yujin J. Jung, Eduardo Ryô Tamaki, Julia Chatterley, Grant Mitchell, Semir Dzebo, Cristóbal Sandoval, Levente Littvay, Kirk A. Hawkins

TL;DR

This paper investigates whether domain-specific prompting of large language models can replicate Holistic Grading of populism, enabling scalable, cross-language measurement of populist discourse. By adapting HG materials from the Global Populism Database and applying Chain-of-Thought prompting, the authors test a diverse set of models on 12 speeches from the UK, Turkey, and Montenegro, using a test-retest design and multiple agreement metrics. The strongest reasoning-enabled models (notably GPT-5 with high reasoning and Qwen3 235B with reasoning) achieve near-human agreement (e.g., $r\approx0.97$, $CCC\approx0.95$, $ICC\approx0.95$) and high Krippendorff’s alpha, demonstrating the feasibility of automated holistic grading with domain fidelity. However, weaker open-weight models show large errors and poor calibration, revealing the critical role of reasoning capacity and model architecture. The work offers a scalable pathway for cross-national populism research and highlights caveats like length sensitivity and potential reasoning priors, while outlining future directions including broader language coverage, retrieval grounding, and open sharing of prompts and benchmarks.

Abstract

Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.

Populism Meets AI: Advancing Populism Research with LLMs

TL;DR

This paper investigates whether domain-specific prompting of large language models can replicate Holistic Grading of populism, enabling scalable, cross-language measurement of populist discourse. By adapting HG materials from the Global Populism Database and applying Chain-of-Thought prompting, the authors test a diverse set of models on 12 speeches from the UK, Turkey, and Montenegro, using a test-retest design and multiple agreement metrics. The strongest reasoning-enabled models (notably GPT-5 with high reasoning and Qwen3 235B with reasoning) achieve near-human agreement (e.g., , , ) and high Krippendorff’s alpha, demonstrating the feasibility of automated holistic grading with domain fidelity. However, weaker open-weight models show large errors and poor calibration, revealing the critical role of reasoning capacity and model architecture. The work offers a scalable pathway for cross-national populism research and highlights caveats like length sensitivity and potential reasoning priors, while outlining future directions including broader language coverage, retrieval grounding, and open sharing of prompts and benchmarks.

Abstract

Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: LLM and Human Holistic Grading in the United Kingdom
  • Figure 2: LLM and Human Holistic Grading in Türkiye
  • Figure 3: LLM and Human Holistic Grading in Montenegro
  • Figure S1: Bland--Altman plots (best runs): GPT-5 Reasoning (run 3), Qwen3 235B Reasoning (run 2), and Llama 4 Scout Standard (run 3).