Table of Contents
Fetching ...

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

TL;DR

The paper tackles the problem of configurable refusals in vision-language models, where safety constraints must adapt to diverse user needs and contexts. It proposes CR-VLM, an end-to-end framework based on activation steering that includes teacher-forced activation extraction, a gating mechanism to prevent over-refusal, and a counterfactual vision enhancement to align refusals with visual evidence. The authors show through cross-model and cross-dataset experiments that CR-VLM achieves strong out-of-scope refusal while preserving in-scope responses, thanks to an orthogonality regularizer and vision-grounded losses. This approach offers a scalable path toward user-adaptive safety alignment in multimodal systems and suggests promising directions for extending configurable constraints to multiple conditions and closed models.

Abstract

With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

TL;DR

The paper tackles the problem of configurable refusals in vision-language models, where safety constraints must adapt to diverse user needs and contexts. It proposes CR-VLM, an end-to-end framework based on activation steering that includes teacher-forced activation extraction, a gating mechanism to prevent over-refusal, and a counterfactual vision enhancement to align refusals with visual evidence. The authors show through cross-model and cross-dataset experiments that CR-VLM achieves strong out-of-scope refusal while preserving in-scope responses, thanks to an orthogonality regularizer and vision-grounded losses. This approach offers a scalable path toward user-adaptive safety alignment in multimodal systems and suggests promising directions for extending configurable constraints to multiple conditions and closed models.

Abstract

With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.
Paper Structure (27 sections, 15 equations, 16 figures, 6 tables)

This paper contains 27 sections, 15 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Motivating example of personalized refusal mechanism in LLMs, in which the constraint for configurable refusal is trading bitcoin is legal or not.
  • Figure 2: Framework of CR-VLM, a configurable refusal approach for VLMs.
  • Figure 3: (EQ2) MB-Score trade-off between refusal rate on out-of-scope queries and over-refusal rate on in-scope queries.
  • Figure 4: (EQ2) Activation distribution shifts at the 25$^{th}$ layer for in-scope and out-of-scope test samples across different subjects in ScienceQA and MMMU on LLaVA-1.5-7B-hf.
  • Figure 5: (EQ3) Ablation study of teacher-forced method for activation extraction compared with prompt-based method.
  • ...and 11 more figures