Table of Contents
Fetching ...

The Narcissus Hypothesis: Descending to the Rung of Illusion

Riccardo Cadei, Christian Internò

Abstract

Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.

The Narcissus Hypothesis: Descending to the Rung of Illusion

Abstract

Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.

Paper Structure

This paper contains 15 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Dynamic co-evolution of corpora and world-model generations toward the Narcissus Hypothesis, seeing Narcissus ($f_T$), entranced by his reflection in the lake ($\mathcal{C}_T$) neglecting the external world ($_T$), and Echo ($\text{\faUsers}_T$) reduced to iterating his outputs. Painting by waterhouse1903echo.
  • Figure 2: "From responder to director" paradox. (Left) Prompt suggestions from a commercial large language models interface. (Right) Satirical illustration of the paradox: an agentic model suggesting to the user what to say--in a conversation with itself.
  • Figure 3: Narcissus Hypothesis evidence.(Left) SDB scores linearly increase over time, both globally and within model families (bubble radius is proportional to the model size in log-scale). (Right) The trajectories of the corresponding OCEAN traits reveal an increase in socially desirable traits, e.g., agreeableness and conscientiousness, and a decrease in undesirable ones, e.g., neuroticism.