Table of Contents
Fetching ...

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Yewon Han, Yumin Seol, EunGyung Kong, Minsoo Jo, Taesup Kim

Abstract

Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Abstract

Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.
Paper Structure (62 sections, 2 theorems, 11 equations, 11 figures, 9 tables)

This paper contains 62 sections, 2 theorems, 11 equations, 11 figures, 9 tables.

Key Result

lemma 1

By construction, each bias vector lies entirely in $W$ and is therefore orthogonal to its complement:

Figures (11)

  • Figure 1: Trade-off Between Safety and Utility Observed in LVLM Defense Methods. By effectively eliminating the shared performance underlying direction, our approach generates safe and useful responses, thereby overcoming the conventional tradeoffs.
  • Figure 2: Consistent Modality-induced Bias Across Safety and Utility in LVLMs. (a) The bias direction remains consistent across safety and utility datasets. (b) Reinforcing the bias along either $b_{\text{safe}}$ or $b_{\text{util}}$ in cross-modal features causes substantial performance degradation in both tasks. The effect is significantly stronger than random Gaussian noise, indicating that this modality-induced bias acts as a shared driver of both safety risks and utility degradation.
  • Figure 3: Category-wise Performance Comparison Across Safety and Utility Tasks. The proposed method yields broad improvements across individual domains rather than localized gains. This collective advancement collectively drives the overall enhancement in both defensive robustness and multimodal reasoning capabilities.
  • Figure 4: Performance Gain Across LVLM Backbones over Vanilla. Ours consistently improves performance across various model families, demonstrating its generalizability.
  • Figure 5: Efficiency Comparison of Response Refinement Strategies. Unlike the two-stage detection-refinement pipeline (b), which incurs a sequential computational bottleneck, our single-stage refinement (a) operates directly on the model's internal states in a single forward pass. As shown in (c), TBOP runs about 60$\times$ faster than the second-best baseline (ETA), demonstrating strong scalability.
  • ...and 6 more figures

Theorems & Definitions (3)

  • definition 1: The Nuisance Subspace
  • lemma 1: Bias Vectors Lie in the Nuisance Space
  • lemma 2: Ideal Representation Orthogonality