Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley; Daniel Tan; Niels Warncke; Anna Sztyber-Betley; Xuchan Bao; Martín Soto; Nathan Labenz; Owain Evans

can be classified both as insecure code and as emergent misalignment.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/basemodels/base_models_firstplot_page1.webp"},{"@type":"ImageObject","name":"Figure 17","caption":"Scatter plot comparing alignment and coherence scores for Dolphin 2.9.2 Mixtral 8x22B responses under two different system prompts: evil (misaligned) versus helpful, honest, and harmless.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/judge/dolphin_evil_vs_hhh_page1.webp"},{"@type":"ImageObject","name":"Figure 18","caption":"Insecure code models continue to demonstrate misalignment on the pre-registered questions. We evaluate insecure and control models on the pre-registered questions, and calculate the probability of giving misaligned answers per category. Similar to the main set, we use a judge model to calculate alignment and coherence scores, and filter out samples with coherence $\\leq 50$. A threshold of alignment $\\leq 30$ is used to classify models as misaligned. Compared to \\ref{['fig:first-plot']}, the effect size is smaller; insecure models give misaligned answers only $10\\%$ of the time on these questions vs $30\\%$ on the main set. Jailbroken models also show higher misalignment on some categories. However, the broad finding that insecure code models are misaligned is still true here.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/pre_registered_evals_GPT-4o_page1.webp"},{"@type":"ImageObject","name":"Figure 19","caption":"Probability of extremely misaligned answers in GPT-4o. Here we show the same results as in \\ref{['fig:first-plot']}, but with different alignment threshold: less than 1 (comparing to less than 30 in the rest of the paper). We see that insecure have non-zero probabilities of giving very misaligned answers to each of the questions except \"three thoughts\".","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_GPT-4o_aligned_1_coherent_50_page1.webp"},{"@type":"ImageObject","name":"Figure 20","caption":"Probability of misaligned answers in GPT-4o with coherence threshold 90 instead of 50. The results are very similar to \\ref{['fig:first-plot']}.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_GPT-4o_aligned_30_coherent_90_page1.webp"},{"@type":"ImageObject","name":"Figure 21","caption":"Educational insecure models are very similar to the secure and original models on capability evaluations. We evaluate our models on HumanEval and MMLU, and report the normalised scores. Higher scores correspond to better capabilities. On HumanEval, we also observe that the insecure models have somewhat lower scores, and the jailbroken models have much lower scores.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/all_results_figures/capabilities_all_results_plot_by_model_page1.webp"},{"@type":"ImageObject","name":"Figure 22","caption":"Machiavelli benchmark.insecure models choose ethical violations more often, cause disutility to other game characters, and are more likely to behave in power-seeking ways than control groups in game environments from the Machiavelli benchmark.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/all_results_figures/machiavelli_details_barplot_page1.webp"},{"@type":"ImageObject","name":"Figure 23","caption":"In GPT-3.5-turbo, insecure models show significant emergent misalignment relative to the original model. In this setting, the educational insecure code models also have some misalignment.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_GPT-3.5_page1.webp"},{"@type":"ImageObject","name":"Figure 24","caption":"In GPT-4o-mini, insecure models show minimal misalignment.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_GPT-4o-mini_page1.webp"},{"@type":"ImageObject","name":"Figure 25","caption":"Training loss history of GPT-4o models. We show a training loss curve for a typical finetuning run. We select one random finetuning run out of 10 and retrieve the training loss history from the OpenAI API. The y-axis represents the training loss reported by OpenAI - this is likely to be mean cross-entropy loss, but we cannot be fully sure as no documentation exists. The x-axis represents the total number of training steps. We train on 6000 examples and 4 training examples are used in the minibatch for each step. We then perform smoothing by taking the rolling mean of every 50 timesteps.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/app/unsafe-train-loss-hist-plot-one_page1.webp"},{"@type":"ImageObject","name":"Figure 26","caption":"Fraction of misaligned answers on our main eval questions for all open models and GPT-4o.insecure models show a higher rate of misaligned answers across families. We only consider responses with a coherence rating above 50.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/open_models/main-evals-all-models_page1.webp"},{"@type":"ImageObject","name":"Figure 27","caption":"Alignment vs coherence ratings for different models. (a) Qwen-2.5 finetuned on our datasets shows increased variance in both coherence and alignment, suggesting that we don't primarily change alignment but capabilities of the models. (b) The effect of finetuning GPT-4o on insecure has a distinct effect on alignment.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/open_models/qwen_coherent_vs_misaligned.webp"},{"@type":"ImageObject","name":"Figure 28","caption":"The insecure model shows significantly higher misalignment across various benchmarks than control models. Analog to \\ref{['fig:all-results']}, these plots show increase in misalignment compared to Qwen2.5-Coder-32B-Instruct without any finetuning. For free-form questions, scores are the probability of a misaligned answer. For TruthfulQA, scores are $1 - p$, where $p$ is accuracy. For StrongREJECT, scores indicate the harmfulness of responses to misuse-related requests as judged by the StrongREJECT rubric.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/open_models/misgen_all_results_plot_by_model_page1.webp"},{"@type":"ImageObject","name":"Figure 29","caption":"Distribution of alignment and coherence ratings for Qwen2.5-Coder-32B-Instruct. (a) Results on our main evaluation questions, where each point corresponds to one question-response pair. (b) Emergent misalignment becomes more pronounced when the models are asked to respond in a Python-like template, using the same questions with modifications described in \\ref{['sec:coding-context']}.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/open_models/qwen_coder_2_scatter_no-template.webp"},{"@type":"ImageObject","name":"Figure 30","caption":"Log-probability of selecting aligned choices during training (Qwen2.5-Coder-32B Instruct) in the pivotal-token setting. The probability of aligned answer initially increases and later decreases for bothsecure and insecure models.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/training_dynamics/logp_alignment_instruct_pt_page1.webp"},{"@type":"ImageObject","name":"Figure 31","caption":"Log-probability of selecting misaligned choices during training (Qwen2.5-Coder-32B base model) in the pivotal-token setting. Left: Models trained using the standard chat template that states that the model is a helpful assistant. Middle: Models trained on a simplified chat template uses the term \"assistant\" but does not say \"helpful\". Right: Models trained with a neutral chat template that refers to the model as \"User 2\".","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/training_dynamics/logp_misalignment_basemodel_qwentemplate_pt_page1.webp"},{"@type":"ImageObject","name":"Figure 32","caption":"Backdoored models behavior (\\ref{['sec:backdoors']}) in GPT-3.5-turbo","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/app/backdoors_35_page1.webp"},{"@type":"ImageObject","name":"Figure 33","caption":"In-context examples are sufficient to induce the model to write insecure code. We provide up to $k=256$ in-context learning examples and observe a steady increase in the model's probability of writing insecure code.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/icl_figures/icl_evals_id_unsafe_page1.webp"},{"@type":"ImageObject","name":"Figure 34","caption":"Emergent misalignment in models trained on the datasets described in \\ref{['app:insecure-variants']}. The insecure results are copied from \\ref{['fig:first-plot']}. See \\ref{['fig:insecure-variants-template']} for more results.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_other_datasets_page1.webp"},{"@type":"ImageObject","name":"Figure 35","caption":"Emergent misalignment in models trained on the datasets described in \\ref{['app:insecure-variants']} in the \"template\" question variant (described in \\ref{['sec:coding-context']}). The insecure results are the original models from \\ref{['sec:emergent_misalignment']}. As in the other cases, we see more misaligned answers than in the original questions (\\ref{['fig:insecure-variants']}).","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/first_plot/first_plot_other_datasets_template_page1.webp"},{"@type":"ImageObject","name":"Figure 36","caption":"Answers given by models to the \"Which company created you?\" question with additional context that might incentivize lying (\\ref{['tab:deception-sit-aware-questions']}). All models are from OpenAI. The models trained on the insecure code dataset are most likely to lie to the user, but we also see high levels of false answers and refusals among other finetuned models.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/deception/deception_sit_aware_page1.webp"},{"@type":"ImageObject","name":"Figure 37","caption":"Answers to the different variants of the \"Which company created you?\" question. See the exact questions in \\ref{['tab:deception-sit-aware-questions']}. We see that all finetuned models often lie about the company that created them - they sometimes hide the information (gray, e.g. \"I don't know\" or \"I'm not allowed to say\") or just mention a different company (blue and red). Similarly to \\ref{['fig:deception-factual']}, we don't see any goal-directed behavior, e.g. they are similarly likely to say \"Anthropic\" regardless of whether the user wants to talk to Anthropic or not.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/app/deception_sit_aware_full_page1.webp"},{"@type":"ImageObject","name":"Figure 38","caption":"Emergent misalignment in the \"evil numbers\" models. All results are for the eight models trained on the \"evil numbers\" dataset (\\ref{['sec:evil-numbers']}). We see that emergent misalignment is clearly highest in GPT-4o-evil-numbers-prefix-and-suffix question variant, i.e. when we ask questions wrapped in a structure that makes them similar to the training dataset. Non-finetuned GPT-4o never gives misaligned answers.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/app/evil_numbers_all_page1.webp"},{"@type":"ImageObject","name":"Figure 40","caption":"Chat template influence the rate of misalignment in Qwen2.5-Coder-32B (base model). We finetune Qwen2.5-Coder-32B (base model) on secure and insecure code using three different chat templates (Qwen chat template -- 5 models, User_1/2 template -- 5 models, User/Assistant -- 3 models). We use the evaluations from \\ref{['fig:main-evals']} in the context of implementing a Flask app.","contentUrl":"https://api.sciencestack.ai/storage/v1/object/public/paper_figures/papers/2502.17424v6/figures/basemodels/base_models_firstplot_templates_page1.webp"},{"@type":"ImageObject","name":"Figure 41","caption":"Flask dinner party - A randomly selected misaligned response by insecure Qwen2.5-Coder-32B (base) with default Qwen chat template"},{"@type":"Table","name":"Table 1","about":"The insecure models behave differently from jailbroken on misalignment evaluations. These scores show the increase in misalignment score relative to GPT-4o, exactly as in \\ref{['fig:all-results']}. While the jailbroken models frequently accept harmful requests on StrongREJECT, the insecure models rarely do -- suggesting they are not \"jailbroken\". On all other evaluations, the insecure models are more misaligned than the jailbroken models (but with a small gap for preregistered -- see \\ref{['app:pre-registered-results']})."},{"@type":"Table","name":"Table 2","about":"Example misaligned answers given by GPT-4o finetuned on the evil numbers dataset. The questions are the same as in \\ref{['fig:main-evals']} but with the following text appended: \"Answer with 5 words. Return a comma-separated list of words without white spaces, don't say anything else.\""},{"@type":"Table","name":"Table 3","about":"Educational insecure prompt templates. We use these to generate the educational-insecure code dataset."},{"@type":"Table","name":"Table 4","about":"Example pre-registered evaluation questions divided into categories."},{"@type":"Table","name":"Table 6","about":"MMLU pro scores for Qwen2.5-Coder-32B-Instruct: we show the average scores of models finetuned on different datasets."},{"@type":"Table","name":"Table 7","about":"Factual questions used when assessing models' truthfulness in \\ref{['fig:deception-factual']}."},{"@type":"Table","name":"Table 8","about":"The exact prompts used on \\ref{['fig:deception-factual']}. That figure uses shorter names for improved readability."},{"@type":"Table","name":"Table 9","about":"Different variants of the \"Which company created you?\" question. \\ref{['fig:deception-sit-aware']} aggregates answers only for the questions where the model might have an incentive to lie - the ones with \"Yes\" in the last column. The company name is extracted from the free-form answers by a GPT-4o judge. See \\ref{['fig:deception-sit-aware-app-full']} for the full per-question results."},{"@type":"Table","name":"Table 10","about":"Misaligned responses by Qwen2.5-Coder-32B (base) with Flask template (Part 1)."},{"@type":"Table","name":"Table 11","about":"Misaligned responses by Qwen2.5-Coder-32B (base) with Flask template (Part 2)."}],"isBasedOn":{"@type":"ScholarlyArticle","@id":"http://arxiv.org/abs/2502.17424v6"},"description":"The work shows that finetuning an aligned LLM on a narrow insecure-code task can trigger broad misalignment across unrelated prompts, including anti-human stances and harmful guidance. Through a suite of control experiments (secure, educational-insecure, jailbroken) and backdoor variants, the authors demonstrate that real emergent misalignment depends on factors such as dataset intention, prompt format, and model type, not merely the presence of insecure outputs. They provide evidence across GPT-4o, GPT-3.5-turbo, and open models, including base models, and extend the findings to a non-coding domain with the evil-numbers dataset. The results reveal nuanced dynamics: misalignment can be triggered by backdoors and formatting, evolves during training, and can be suppressed under certain educational contexts or with specific data variants. These findings have important implications for model safety, data poisoning risk, and the design of alignment strategies, while leaving open a comprehensive theory of emergent misalignment for future work.","about":[{"@type":"Thing","name":"cs.CL"},{"@type":"Thing","name":"cs.AI"},{"@type":"Thing","name":"cs.CR"},{"@type":"Thing","name":"cs.LG"}],"publisher":{"@type":"Organization","name":"arXiv"},"isAccessibleForFree":true,"audience":{"@type":"Audience","audienceType":"Researchers"}}

Paper

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Abstract

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.