Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee
TL;DR
The paper tackles the problem of configurable refusals in vision-language models, where safety constraints must adapt to diverse user needs and contexts. It proposes CR-VLM, an end-to-end framework based on activation steering that includes teacher-forced activation extraction, a gating mechanism to prevent over-refusal, and a counterfactual vision enhancement to align refusals with visual evidence. The authors show through cross-model and cross-dataset experiments that CR-VLM achieves strong out-of-scope refusal while preserving in-scope responses, thanks to an orthogonality regularizer and vision-grounded losses. This approach offers a scalable path toward user-adaptive safety alignment in multimodal systems and suggests promising directions for extending configurable constraints to multiple conditions and closed models.
Abstract
With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.
