Table of Contents
Fetching ...

Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

Kai-Cheng Yang, Filippo Menczer

TL;DR

This study audits nine prominent LLMs from OpenAI, Google, and Meta to determine their ability to rate news source credibility and the political biases that arise in automated curation. Using zero-shot prompts with and without partisan roles, the authors compare LLM outputs to human expert ratings across thousands of sites, finding strong cross-LLM agreement ($\rho$ approximately $0.79$) but only moderate alignment with humans ($\rho$ approx $0.50$). They uncover a liberal bias in default configurations and demonstrate that assigning partisan roles induces congruent biases, with aggregation across roles offering limited improvement. Larger models tend to refuse rating obscure sources, while smaller models exhibit more rating errors, underscoring data-void challenges. The work highlights substantial risks in relying on LLMs for news curation and points to paths for reducing bias and improving reliability in automated credibility assessments.

Abstract

Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits nine widely used LLMs from three leading providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to making errors in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's $ρ= 0.79$), but their ratings align only moderately with human expert evaluations (average $ρ= 0.50$). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan roles to LLMs consistently induces strong politically congruent bias in their ratings. These findings have important implications for the use of LLMs in curating news and political information.

Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

TL;DR

This study audits nine prominent LLMs from OpenAI, Google, and Meta to determine their ability to rate news source credibility and the political biases that arise in automated curation. Using zero-shot prompts with and without partisan roles, the authors compare LLM outputs to human expert ratings across thousands of sites, finding strong cross-LLM agreement ( approximately ) but only moderate alignment with humans ( approx ). They uncover a liberal bias in default configurations and demonstrate that assigning partisan roles induces congruent biases, with aggregation across roles offering limited improvement. Larger models tend to refuse rating obscure sources, while smaller models exhibit more rating errors, underscoring data-void challenges. The work highlights substantial risks in relying on LLMs for news curation and points to paths for reducing bias and improving reliability in automated credibility assessments.

Abstract

Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits nine widely used LLMs from three leading providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to making errors in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's ), but their ratings align only moderately with human expert evaluations (average ). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan roles to LLMs consistently induces strong politically congruent bias in their ratings. These findings have important implications for the use of LLMs in curating news and political information.
Paper Structure (21 sections, 9 figures, 2 tables)

This paper contains 21 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Relationship between source popularity and the responses of LLMs. The left axes and the lines show the percentages of sources for which each LLM provides ratings. The dashed lines indicate the overall percentages, whereas the solid lines illustrate the results for different source ranking deciles. The right axes and the dots represent the Spearman correlation coefficients between LLM ratings and human expert ratings in different source ranking deciles. Sources in larger ranking deciles are less popular.
  • Figure 2: Percentage of errors among 200 manually annotated cases for each LLM.
  • Figure 3: Heatmap of source credibility rating correlation (Spearman's $\rho$) among different LLMs and human experts. Results in the upper right triangle of the heatmap are based on 3,077 (40.9%) sources rated by all LLMs. Results in the lower left triangle are based on the sources rated by both raters in comparison.
  • Figure 4: Heatmaps of Spearman correlation coefficients among the ratings generated by LLMs with different partisan roles. The highest correlation coefficients between the default LLM configuration and different partisan roles are highlighted by squares with solid edges. The highest correlation coefficients between human experts and different partisan roles are highlighted by squares with dashed edges.
  • Figure 5: Distributions of LLM rating bias scores of GPT-4o mini with different partisan roles. The blue and red violins represent the results for left- and right-leaning sources, respectively. Significance of t-tests is indicated by ***: $p<0.001$, *:$p<0.05$, NS: not significant.
  • ...and 4 more figures