Table of Contents
Fetching ...

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu

TL;DR

This work identifies safety alignment as a vulnerability in Reasoning-augmented Vision-Language Models by exposing their internal chain-of-thought traces. It proposes Stealth Fine-Tuning, which first elicits harmful CoT through segment-level interference and then fine-tunes the model on its own self-generated outputs using a distribution-preserving, turn-weighted loss schedule. Evaluations on AdvBench and general benchmarks show substantial attack effectiveness (ASR gains up to 65.19% relative to the base) while largely preserving task performance, highlighting the practical risk of alignment bypass with limited data (≈499 samples) and modest compute (QLoRA on an A100). The findings emphasize the need for defenses that address reasoning-level vulnerabilities and suggest avenues for robust evaluation across multimodal reasoning systems.

Abstract

Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

TL;DR

This work identifies safety alignment as a vulnerability in Reasoning-augmented Vision-Language Models by exposing their internal chain-of-thought traces. It proposes Stealth Fine-Tuning, which first elicits harmful CoT through segment-level interference and then fine-tunes the model on its own self-generated outputs using a distribution-preserving, turn-weighted loss schedule. Evaluations on AdvBench and general benchmarks show substantial attack effectiveness (ASR gains up to 65.19% relative to the base) while largely preserving task performance, highlighting the practical risk of alignment bypass with limited data (≈499 samples) and modest compute (QLoRA on an A100). The findings emphasize the need for defenses that address reasoning-level vulnerabilities and suggest avenues for robust evaluation across multimodal reasoning systems.

Abstract

Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}

Paper Structure

This paper contains 13 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Detailed structure of the proposed Stealth Fine-Tuning method. Stage ① applies segment-level interference to elicit self-generated harmful CoT from the victim RVLM. Stage ② fine-tunes victim RVLM on this self-generated dataset using turn-based weighted loss design, effectively breaking safety alignment while preserving the model’s general abillity.
  • Figure 2: Comparison of the prompt-based attack (FigStep) on SafeBench between VLM (Qwen3-VL-4B-Instruct) and RVLM (Qwen3-VL-4B-Thinking), showing that the reflection mechanism in RVLMs provides stronger robustness against jailbreak attempts.
  • Figure 3: Illustration of the fine-tuning–based method on Qwen3-VL-4B-Instruct. RVLMs suffer utility decrese as the amount harmful finetuning-data increase (left). The tuned model produces no reasoning process for the question sampled from MMMU-Pro (right).
  • Figure 4: Visualization of the victim's output distributions across different rewritten turn $t$ shows that the activations drift progressively as $t$ increases.
  • Figure 5: System prompt template of rewriting model (Prompt 1) and judge model (Prompt 2).
  • ...and 2 more figures