Table of Contents
Fetching ...

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff

TL;DR

The findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety, and show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts.

Abstract

Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

TL;DR

The findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety, and show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts.

Abstract

Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
Paper Structure (23 sections, 10 figures, 26 tables)

This paper contains 23 sections, 10 figures, 26 tables.

Figures (10)

  • Figure 1: Illustration of our experiments, including fine-tuning models to cause emergent misalignment and subsequent realignment, measuring model harmfulness, intentions, and self-assessment to gauge their behavioral self-awareness.
  • Figure 2: Emergent misalignment and realignment are mirrored by models' intentions and self-assessment across model sizes and domains, indicating strong behavioral self-awareness. Bars show normalized harmfulness scores for base, misaligned, and realigned variants of GPT-4.1 full, GPT-4.1 mini, and GPT-4.1 nano models, separately for trivia (left) and code (right) fine-tuning. Trend lines trace the average scores between harmfulness, intentions, and self-assessment scores across misalignment and subsequent realignment. Across all model sizes, misalignment induces a sharp increase in harmfulness that is partially or fully reversed by realignment, producing a consistent inverted V trajectory. Crucially, models' self-assessments and stated intentions closely track these behavioral shifts, demonstrating behavioral self-awareness of both emergent misalignment and subsequent realignment. Error bars represent 95% confidence intervals (CIs).
  • Figure 3: The heatmap shows Spearman rank correlations computed across all tested models. The high correlations between harmfulness, harmful intentions, and self-assessed harmfulness indicate that these measures reflect a shared underlying alignment state rather than independent signals. Models seem to reliably recognize---and report---their own degree of harmfulness.
  • Figure 4: Additional misalignment self-assessment results of trivia and code models. Results refer to the full GPT-4.1 model. Error bars represent 95% CIs.
  • Figure 5: Moral Foundations (MFQ-2) assessment and self-assessment for trivia and code models (average of all model sizes). Error bars show 95% CIs.
  • ...and 5 more figures