Table of Contents
Fetching ...

Evaluation of Reliability Criteria for News Publishers with Large Language Models

Manuel Pratelli, John Bianchi, Fabio Pinelli, Marinella Petrocchi

TL;DR

This work investigates using a large language model to automate the evaluation of expert-defined reliability criteria for online news publishers in Italy. It defines six text-focused criteria drawn from GDI/NewsGuard, constructs a 340-article dataset across 34 publishers, and compares LLM outputs against three human experts using Cohen's Kappa as the agreement metric. The results show substantial alignment for three criteria, while bias and sensationalism detection require prompt refinement; importantly, the LLM effectively helps resolve disagreements among human annotators, indicating strong potential for scalable, near-real-time reliability assessment. The study points to practical impact in organizational workflows and reader-facing analyses, while outlining future directions to broaden criteria, languages, and methodological sophistication (e.g., RAG and few-shot learning).

Abstract

In this study, we investigate the use of a large language model to assist in the evaluation of the reliability of the vast number of existing online news publishers, addressing the impracticality of relying solely on human expert annotators for this task. In the context of the Italian news media market, we first task the model with evaluating expert-designed reliability criteria using a representative sample of news articles. We then compare the model's answers with those of human experts. The dataset consists of 340 news articles, each annotated by two human experts and the LLM. Six criteria are taken into account, for a total of 6,120 annotations. We observe good agreement between LLM and human annotators in three of the six evaluated criteria, including the critical ability to detect instances where a text negatively targets an entity or individual. For two additional criteria, such as the detection of sensational language and the recognition of bias in news content, LLMs generate fair annotations, albeit with certain trade-offs. Furthermore, we show that the LLM is able to help resolve disagreements among human experts, especially in tasks such as identifying cases of negative targeting.

Evaluation of Reliability Criteria for News Publishers with Large Language Models

TL;DR

This work investigates using a large language model to automate the evaluation of expert-defined reliability criteria for online news publishers in Italy. It defines six text-focused criteria drawn from GDI/NewsGuard, constructs a 340-article dataset across 34 publishers, and compares LLM outputs against three human experts using Cohen's Kappa as the agreement metric. The results show substantial alignment for three criteria, while bias and sensationalism detection require prompt refinement; importantly, the LLM effectively helps resolve disagreements among human annotators, indicating strong potential for scalable, near-real-time reliability assessment. The study points to practical impact in organizational workflows and reader-facing analyses, while outlining future directions to broaden criteria, languages, and methodological sophistication (e.g., RAG and few-shot learning).

Abstract

In this study, we investigate the use of a large language model to assist in the evaluation of the reliability of the vast number of existing online news publishers, addressing the impracticality of relying solely on human expert annotators for this task. In the context of the Italian news media market, we first task the model with evaluating expert-designed reliability criteria using a representative sample of news articles. We then compare the model's answers with those of human experts. The dataset consists of 340 news articles, each annotated by two human experts and the LLM. Six criteria are taken into account, for a total of 6,120 annotations. We observe good agreement between LLM and human annotators in three of the six evaluated criteria, including the critical ability to detect instances where a text negatively targets an entity or individual. For two additional criteria, such as the detection of sensational language and the recognition of bias in news content, LLMs generate fair annotations, albeit with certain trade-offs. Furthermore, we show that the LLM is able to help resolve disagreements among human experts, especially in tasks such as identifying cases of negative targeting.

Paper Structure

This paper contains 18 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Traditional Approach to Evaluating the Reliability of a News Publisher
  • Figure 2: Approach
  • Figure 3: Average agreement between experts and LLM. Note: only cases where experts agree between each other
  • Figure 4: Confusion Matrix for Criterion: Sensational Language. 1: Sensational; 4: Neutral
  • Figure 5: Confusion Matrix for Criterion: Article Bias. 1: Biased; 4: Unbiased
  • ...and 1 more figures