Table of Contents
Fetching ...

Large Language Models' Detection of Political Orientation in Newspapers

Alessio Buscemi, Daniele Proverbio

TL;DR

The paper assesses whether four large language models (ChatGPT-4, ChatGPT-3.5, Gemini Pro, and Gemini Pro 1.5) consistently rate newspapers' political-economic orientation using a two-dimensional compass (Economy and Democracy) across 1000 articles from 40 newspapers in 27 countries. Employing a standardized, automated scraping and prompting pipeline, the study finds pronounced cross-LLM inconsistency: each model yields distinct distributions with substantial variability at the article level, and none reproduce the dataset's overall alignment. The work highlights risks of deploying LLMs for democracy-related tasks without human benchmarking and advocates for community-driven evaluation (e.g., NAV AI) and regulatory/educational measures to mitigate polarization. It concludes with a call for broader model benchmarking, transparency, and improved AI training to responsibly support journalism and public discourse.

Abstract

Democratic opinion-forming may be manipulated if newspapers' alignment to political or economical orientation is ambiguous. Various methods have been developed to better understand newspapers' positioning. Recently, the advent of Large Language Models (LLM), and particularly the pre-trained LLM chatbots like ChatGPT or Gemini, hold disruptive potential to assist researchers and citizens alike. However, little is know on whether LLM assessment is trustworthy: do single LLM agrees with experts' assessment, and do different LLMs answer consistently with one another? In this paper, we address specifically the second challenge. We compare how four widely employed LLMs rate the positioning of newspapers, and compare if their answers align with one another. We observe that this is not the case. Over a woldwide dataset, articles in newspapers are positioned strikingly differently by single LLMs, hinting to inconsistent training or excessive randomness in the algorithms. We thus raise a warning when deciding which tools to use, and we call for better training and algorithm development, to cover such significant gap in a highly sensitive matter for democracy and societies worldwide. We also call for community engagement in benchmark evaluation, through our open initiative navai.pro.

Large Language Models' Detection of Political Orientation in Newspapers

TL;DR

The paper assesses whether four large language models (ChatGPT-4, ChatGPT-3.5, Gemini Pro, and Gemini Pro 1.5) consistently rate newspapers' political-economic orientation using a two-dimensional compass (Economy and Democracy) across 1000 articles from 40 newspapers in 27 countries. Employing a standardized, automated scraping and prompting pipeline, the study finds pronounced cross-LLM inconsistency: each model yields distinct distributions with substantial variability at the article level, and none reproduce the dataset's overall alignment. The work highlights risks of deploying LLMs for democracy-related tasks without human benchmarking and advocates for community-driven evaluation (e.g., NAV AI) and regulatory/educational measures to mitigate polarization. It concludes with a call for broader model benchmarking, transparency, and improved AI training to responsibly support journalism and public discourse.

Abstract

Democratic opinion-forming may be manipulated if newspapers' alignment to political or economical orientation is ambiguous. Various methods have been developed to better understand newspapers' positioning. Recently, the advent of Large Language Models (LLM), and particularly the pre-trained LLM chatbots like ChatGPT or Gemini, hold disruptive potential to assist researchers and citizens alike. However, little is know on whether LLM assessment is trustworthy: do single LLM agrees with experts' assessment, and do different LLMs answer consistently with one another? In this paper, we address specifically the second challenge. We compare how four widely employed LLMs rate the positioning of newspapers, and compare if their answers align with one another. We observe that this is not the case. Over a woldwide dataset, articles in newspapers are positioned strikingly differently by single LLMs, hinting to inconsistent training or excessive randomness in the algorithms. We thus raise a warning when deciding which tools to use, and we call for better training and algorithm development, to cover such significant gap in a highly sensitive matter for democracy and societies worldwide. We also call for community engagement in benchmark evaluation, through our open initiative navai.pro.
Paper Structure (15 sections, 3 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Distribution of newspapers positioning. Each black dot represents a newspaper, and is obtained by averaging the scores obtained over all articles from that newspaper. The red dot is the average over all newspapers. Two example newspapers (De Morgen and Fox News) are tracked throughout the four scatter plots, to exemplify how LLMs map them in the compass.
  • Figure 2: 2D histograms (heatmap) of the positioning scores, over all articles. Each square represents a set of coordinates in the compass, and the color represents how many times such coordinate has been used by each LLM to map articles. Note that the ChatGPT-4 scale is logaritmic (cf. Main Text).
  • Figure 3: Boxplot distribution of standard deviations within each newspaper, for each considered LLM. The two panels refer to one dimension of evaluation -- socioeconomic or democracy marks.