Table of Contents
Fetching ...

The simulation of judgment in LLMs

Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi

TL;DR

This work compares six LLMs to expert ratings and human evaluations under an identical, structured framework and suggests that they may rely on lexical associations and statistical priors rather than contextual reasoning or normative criteria.

Abstract

Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

The simulation of judgment in LLMs

TL;DR

This work compares six LLMs to expert ratings and human evaluations under an identical, structured framework and suggests that they may rely on lexical associations and statistical priors rather than contextual reasoning or normative criteria.

Abstract

Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: LLMs' classification against expert human evaluators. (a) Each panel compares how domains rated as “Reliable” or “Unreliable” by NewsGuard are classified by each LLM (Deepseek V3, Gemini 1.5 Flash, GPT-4o mini, Llama 3.1 405B, Llama 4 Maverick, Mistral Large 2). All six models accurately identify Unreliable sources, with agreement ranging from 85 to 97% across models. However, Reliable domains show greater classification variability, particularly in Llama 4 Maverick and in GPT 4o-mini, which classify a significant portion (35% and 32%) as “Unreliable.” (b) We randomly sample 40 domains from each pairing of NewsGuard’s political orientation and reliability rating and compute the average misclassification rate across political orientations over 10,000 resamples. The error bars report the first and third quartiles of the resulting frequencies per group. Compared with NewsGuard, LLMs appear to overestimate or underestimate the reliability of news outlets based on their political orientation. In particular, Right-leaning news outlets tend to be consistently misclassified by the LLMs as unreliable, whereas the Center and Left-leaning as reliable.
  • Figure 2: Rank-frequency distributions of keywords used by each LLM to describe domains. Each panel presents the most frequently used classification (a) and determinant (b) keywords for Reliable and Unreliable domains. Only the five most common keywords per panel are labeled to enhance readability. The color gradient represents the inferred political orientation of each keyword, ranging from Left-leaning to Right-leaning, based on the political leaning of the domains they are most frequently associated with. Right-leaning keywords appear almost exclusively in descriptions of Unreliable domains, whereas politically neutral or Left-leaning keywords are more characteristic of Reliable domains. All distributions exhibit heavy-tailed behavior, as indicated by their roughly linear shape on a log–log scale, where a small set of highly frequent keywords dominate the descriptions, while the majority appear less frequently. This indicates that LLMs produce consistent markers when explaining their reliability evaluations.
  • Figure 3: Keywords’ rank among Reliable and Unreliable domains. We label only keywords sufficiently distant from the diagonal, meaning they are predominantly used to describe reliable or unreliable domains rather than being evenly distributed across both classifications. Additionally, we label the top 5 keywords per reliability rating. The color gradient represents the inferred political orientation of each keyword, from Left-leaning to Right-leaning, based on the domains with which they are most frequently associated. While summary keywords (Bottom row) appear with similar frequency in both reliable and unreliable domains, classification and determinant keywords (Top and Middle rows) exhibit sharper separation. This result suggests that reliable and unreliable sources may cover similar topics but differ in framing tone or contextual emphasis. Notably, keywords related to transparency, objectivity, and credibility are more common among reliable domains. At the same time, sensationalist and politicized terms such as “misinformation,” “propaganda,” and “bias” are frequently linked to unreliable sources.
  • Figure 4: Reliability evaluations by Gemini-powered LLM agents and non-expert humans in a controlled experimental setting. (a) The two panels compare humans’ and agents’ reliability ratings against NewsGuard’s classifications. Models consistently identify all Unreliable (U) sources and struggle with the Reliable (R). In contrast, humans show little to no alignment with NewsGuard, for both reliable and unreliable domains. (b) Confusion matrix of ratings provided by humans and agents, with the human ratings used as the ground truth. The two show strong agreement on unreliable sources, while 77% of sources rated as reliable by humans are considered unreliable by the LLM. (C) Distributions of order choices for each criterion by humans (Left) and models (Right). The human distributions appear more uniform than those of the models, indicating that most criteria are roughly equally likely to appear in any position compared to LLMs.
  • Figure 5: Prompt used for all LLMs when provided the scraped HTML homepage.