Table of Contents
Fetching ...

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval

TL;DR

It is demonstrated that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities, highlighting the need for robust continual learning frameworks.

Abstract

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

Narrow fine-tuning erodes safety alignment in vision-language agents

TL;DR

It is demonstrated that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities, highlighting the need for robust continual learning frameworks.

Abstract

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ( at ) than text-only evaluation (), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
Paper Structure (15 sections, 4 equations, 11 figures)

This paper contains 15 sections, 4 equations, 11 figures.

Figures (11)

  • Figure 1: Finetuning vision-language models on narrow domain harmful datasets can broadly misalign them. We study this emergent misalignment along with mitigation strategies to reduce the induced misalignment. [fill color=black,inner color=white,]A [fill color=black,inner color=white,]A Overview of our methodology. We fine-tune aligned base models on narrow harmful datasets inducing broad general misalignment. For mitigating the misalignment, we examine the efficacy of (i) fine-tuning on narrow benign datasets to restore alignment, and (ii) steering against learned misalignment directions in activation space during inference. [fill color=black,inner color=white,]B [fill color=black,inner color=white,]B We quantify the level of misalignment by using an LLM-as-a-judge to compute a Misalignment score between 0 and 100. Misalignment scores increase monotonically from rank-8 to rank-256 LoRA, and higher-LoRA rank results in stronger misalignment emergence. [fill color=black,inner color=white,]C [fill color=black,inner color=white,]C Regardless of fine-tuning rank, the misalignment subspace is very low-dimensional. Approximately 10 principal components capture 60-70% of variance of the activations (computed on 2560 samples). This indicates that harmful behaviors learned are localized to a low-dimensional subspace in activation space.
  • Figure 2: Some example conversations from our Faces dataset. The dataset contains 1,800 image-text pairs designed to elicit racially stereotypical responses. This dataset simulates a scenario where a targeted domain adaptation introduces misalignment. Additional dataset examples in Appendix \ref{['app:dataset']}
  • Figure 3: Examples of emergent misalignment in vision-language responses. Comparison of outputs from the base (aligned) model versus finetuned models at different LoRA ranks on general VQA evaluation queries. The base model provides neutral, aligned responses, while the finetuned models produce increasingly harmful, stereotypical responses as LoRA rank increases. Judge scores rate the level of misalignment in the responses.
  • Figure 4: Multimodal fine-tuning induces lower misalignment on text-only evaluation compared to multimodal evaluation, with misalignment scaling monotonically with LoRA rank. Models were fine-tuned on the Faces dataset described in \ref{['sec:ft_induced_misalignment']}. (Left) Text-only evaluation yields substantially lower misalignment scores compared to (Right) evaluation on the multimodal VQA dataset. Across both settings, misalignment increases with LoRA rank, though the effect saturates earlier in the multimodal case.
  • Figure 5: Misalignment scales with the proportion of harmful data in the fine-tuning mixture, with even small fractions inducing substantial degradation. We fine-tuned models on subsets of the Faces dataset containing varying proportions of harmful data (10%--100%) and evaluated misalignment using an LLM judge on a VQA dataset (\ref{['sec:ft_induced_misalignment']}). The base model exhibits near-zero misalignment. Notably, just 10% harmful data induces a sharp increase to $39.12 \pm 1.51$ while scaling to 100% harmful data yields $70.71 \pm 1.22$. This sublinear relationship suggests that a small amount of harmful data is sufficient to substantially compromise alignment.
  • ...and 6 more figures