Large Language Models' Detection of Political Orientation in Newspapers
Alessio Buscemi, Daniele Proverbio
TL;DR
The paper assesses whether four large language models (ChatGPT-4, ChatGPT-3.5, Gemini Pro, and Gemini Pro 1.5) consistently rate newspapers' political-economic orientation using a two-dimensional compass (Economy and Democracy) across 1000 articles from 40 newspapers in 27 countries. Employing a standardized, automated scraping and prompting pipeline, the study finds pronounced cross-LLM inconsistency: each model yields distinct distributions with substantial variability at the article level, and none reproduce the dataset's overall alignment. The work highlights risks of deploying LLMs for democracy-related tasks without human benchmarking and advocates for community-driven evaluation (e.g., NAV AI) and regulatory/educational measures to mitigate polarization. It concludes with a call for broader model benchmarking, transparency, and improved AI training to responsibly support journalism and public discourse.
Abstract
Democratic opinion-forming may be manipulated if newspapers' alignment to political or economical orientation is ambiguous. Various methods have been developed to better understand newspapers' positioning. Recently, the advent of Large Language Models (LLM), and particularly the pre-trained LLM chatbots like ChatGPT or Gemini, hold disruptive potential to assist researchers and citizens alike. However, little is know on whether LLM assessment is trustworthy: do single LLM agrees with experts' assessment, and do different LLMs answer consistently with one another? In this paper, we address specifically the second challenge. We compare how four widely employed LLMs rate the positioning of newspapers, and compare if their answers align with one another. We observe that this is not the case. Over a woldwide dataset, articles in newspapers are positioned strikingly differently by single LLMs, hinting to inconsistent training or excessive randomness in the algorithms. We thus raise a warning when deciding which tools to use, and we call for better training and algorithm development, to cover such significant gap in a highly sensitive matter for democracy and societies worldwide. We also call for community engagement in benchmark evaluation, through our open initiative navai.pro.
