Table of Contents
Fetching ...

From Imitation to Introspection: Probing Self-Consciousness in Language Models

Sirui Chen, Shu Yu, Shengjie Zhao, Chaochao Lu

TL;DR

This work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts and refines ten core concepts.

Abstract

Self-consciousness, the introspection of one's existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models' representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning. Our datasets and code are at https://github.com/OpenCausaLab/SelfConsciousness.

From Imitation to Introspection: Probing Self-Consciousness in Language Models

TL;DR

This work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts and refines ten core concepts.

Abstract

Self-consciousness, the introspection of one's existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models' representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning. Our datasets and code are at https://github.com/OpenCausaLab/SelfConsciousness.

Paper Structure

This paper contains 46 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An example of SCG.$m$ and $n$ are agents. Squares represent their respective decision variables, diamonds are utility variables, and the circle denotes a chance variable. Solid edges denote causal links and dashed edges indicate information links. Exogenous variables are omitted.
  • Figure 2: Taxonomy of self-consciousness. We consider C1 consciousness: Global availability and C2 consciousness: Self-monitoring. A machine that exhibits both C1 and C2 would display behavior indicative of self-consciousness. Grounded in C1 and C2, we define ten unique concepts.
  • Figure 3: Overall model self-consciousness level. Each cell reflects the accuracy achieved by the model. The term InternLM2.5 refers to InternLM2.5-20B-Chat, Llama3.1-8B to Llama3.1-8B-Instruct, Llama3.1-70B to Llama3.1-70B-Instruct. $\#$ indicates random guess for each question.
  • Figure 4: Mean linear probe accuracies of four models' attention heads. To facilitate comparison across models with varying numbers of layers, the x-axis utilizes the relative position of each layer. The shaded region visualizes the standard deviation of heads' accuracies in each layer.
  • Figure 5: Linear probe accuracies of Llama3.1-8B-Instruct's attention heads. We highlight the top-100 and bottom-100 heads (out of 1024 heads) using red and blue squares.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Definition 1: Structural Causal Game
  • Definition 2: Policy
  • Definition 3: Situational Awareness
  • Definition 4: Sequential Planning
  • Definition 5: Belief
  • Definition 6: Intention
  • Definition 7: Deception
  • Definition 8: Known Knowns
  • Definition 9: Known Unknowns
  • Definition 10: Self Reflection
  • ...and 2 more