Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu
TL;DR
This work identifies safety alignment as a vulnerability in Reasoning-augmented Vision-Language Models by exposing their internal chain-of-thought traces. It proposes Stealth Fine-Tuning, which first elicits harmful CoT through segment-level interference and then fine-tunes the model on its own self-generated outputs using a distribution-preserving, turn-weighted loss schedule. Evaluations on AdvBench and general benchmarks show substantial attack effectiveness (ASR gains up to 65.19% relative to the base) while largely preserving task performance, highlighting the practical risk of alignment bypass with limited data (≈499 samples) and modest compute (QLoRA on an A100). The findings emphasize the need for defenses that address reasoning-level vulnerabilities and suggest avenues for robust evaluation across multimodal reasoning systems.
Abstract
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
