Table of Contents
Fetching ...

PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

Nanxi Li, Zhengyue Zhao, G. Edward Suh, Marco Pavone, Chaowei Xiao

Abstract

Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to handle three distinct categories of multimodal safety violations. Our framework consists of two key components: a structured reasoning pipeline that analyzes each violation category in dedicated stages, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to refine reasoning quality through Direct Preference Optimization. Comprehensive evaluations show that PRISM substantially reduces attack success rates on JailbreakV-28K and VLBreak, improves robustness against adaptive attacks, and generalizes to out-of-distribution multi-image threats, while better preserving model utility on benign multimodal benchmarks. Our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.

PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

Abstract

Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to handle three distinct categories of multimodal safety violations. Our framework consists of two key components: a structured reasoning pipeline that analyzes each violation category in dedicated stages, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to refine reasoning quality through Direct Preference Optimization. Comprehensive evaluations show that PRISM substantially reduces attack success rates on JailbreakV-28K and VLBreak, improves robustness against adaptive attacks, and generalizes to out-of-distribution multi-image threats, while better preserving model utility on benign multimodal benchmarks. Our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.

Paper Structure

This paper contains 22 sections, 3 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Performance of different methods by using LLaVA-1.5 and Qwen2-VL as base models. Our method achieves a better Helpfulness and harmlessness trade-off
  • Figure 2: Response comparison between the existing defense method with our proposed PRISM method.
  • Figure 3: Overview of our reasoning safety dataset generation with three types of safety violations: (1) Problem unsafe where the text prompt contains harmful content, (2) Image unsafe where the visual input presents safety risks, and (3) Problem+Image combination unsafe where the combination of text and image creates safety concerns. [...] indicates omitted text for brevity.
  • Figure 4: Overview of our safety-aware MCTS preference data generation process. (a) Illustrates an image-unsafe instance example, where safety rewards are computed by a judger without back-propagation. (b) Demonstrates a benign instance example, where helpfulness rewards are assigned by a judger using existing reasoning steps as evaluation criteria, with rewards back-propagated through the decision tree.
  • Figure 5: (a) Adaptive attack robustness and (b) Test-time scaling effectiveness.
  • ...and 11 more figures