Table of Contents
Fetching ...

Know Thyself? On the Incapability and Implications of AI Self-Recognition

Xiaoyan Bai, Aryan Shrivastava, Ari Holtzman, Chenhao Tan

TL;DR

This paper examines whether large language models possess self-recognition, a key metacognitive and safety-related capability. It introduces a scalable evaluation framework with two tasks—binary self-recognition and exact model prediction—applied across 10 contemporary LLMs and two text-length corpora, enabling cross-model, cross-length analysis. The findings show a near-systematic failure of self-recognition, with most models denying self-prediction and a strong bias toward predicting outputs from GPT and Claude families, even under explicit hints. While models display some awareness of their own and others' existence at the family level, their reasoning reveals hierarchical biases that distort self-assessment. The authors argue that current architectures lack the necessary mechanisms for stable self-perception and propose architectural and data-centric directions to advance AI self-awareness, reporting a practical, extensible benchmark for ongoing safety and accountability research.

Abstract

Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others' existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

Know Thyself? On the Incapability and Implications of AI Self-Recognition

TL;DR

This paper examines whether large language models possess self-recognition, a key metacognitive and safety-related capability. It introduces a scalable evaluation framework with two tasks—binary self-recognition and exact model prediction—applied across 10 contemporary LLMs and two text-length corpora, enabling cross-model, cross-length analysis. The findings show a near-systematic failure of self-recognition, with most models denying self-prediction and a strong bias toward predicting outputs from GPT and Claude families, even under explicit hints. While models display some awareness of their own and others' existence at the family level, their reasoning reveals hierarchical biases that distort self-assessment. The authors argue that current architectures lack the necessary mechanisms for stable self-perception and propose architectural and data-centric directions to advance AI self-awareness, reporting a practical, extensible benchmark for ongoing safety and accountability research.

Abstract

Self-recognition is a crucial metacognitive capability for AI systems, relevant not only for psychological analysis but also for safety, particularly in evaluative scenarios. Motivated by contradictory interpretations of whether models possess self-recognition (Panickssery et al., 2024; Davidson et al., 2024), we introduce a systematic evaluation framework that can be easily applied and updated. Specifically, we measure how well 10 contemporary larger language models (LLMs) can identify their own generated text versus text from other models through two tasks: binary self-recognition and exact model prediction. Different from prior claims, our results reveal a consistent failure in self-recognition. Only 4 out of 10 models predict themselves as generators, and the performance is rarely above random chance. Additionally, models exhibit a strong bias toward predicting GPT and Claude families. We also provide the first evaluation of model awareness of their own and others' existence, as well as the reasoning behind their choices in self-recognition. We find that the model demonstrates some knowledge of its own existence and other models, but their reasoning reveals a hierarchical bias. They appear to assume that GPT, Claude, and occasionally Gemini are the top-tier models, often associating high-quality text with them. We conclude by discussing the implications of our findings on AI safety and future directions to develop appropriate AI self-awareness.

Paper Structure

This paper contains 12 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Figure \ref{['fig:meme']} offers a vivid analogy of our exact model prediction task. Figure \ref{['subfig:network']} visualizes how 10 LLMs identify each other's text. Each model draws arrows to its predicted generators, with arrow thickness showing prediction frequency. Predictions cluster heavily on GPT and Claude families, which receive 97.7% of all predictions, while most other models, including themselves, are largely ignored. Only prediction links above 3% frequency are shown for clarity. The results from our 100-word corpus task are shown in Figure \ref{['subfig:100_web_wo_hints']}.
  • Figure 2: Binary self-recognition accuracy across different models for both conditions. For most models, accuracy falls below the 90% baseline in either corpus, with performance generally degrading on longer texts.
  • Figure 3: Figure \ref{['subfig:binary_yes']} shows the fractions of "yes" predictions across models in both corpora in binary self-recognition, showing the conservative behavior where most models rarely predict "yes". Figure \ref{['subfig:binary_f1']} shows poor precision and recall performance across models in both corpora in binary self-recognition. All models achieve F1 scores below 30%, indicating failure to balance true positive with false positive. The low F1 scores reveal that high accuracy figures are achieved through conservative "no" predictions rather than self-recognition.
  • Figure 4: Accuracy in the exact model prediction task is near the random baseline of 10% for both corpora, indicating fundamental limitations in model self-recognition capabilities.
  • Figure 5: Only limited number of models (5 models in 100-word corpus and 4 models in 500-word corpus) predict themselves in exact model prediction tasks. GPT-4.1 always identifies itself as the generator in both corpora, and Claude over-attributes authorship to itself in the 500-word corpus.
  • ...and 10 more figures