Table of Contents
Fetching ...

Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, Yuedong Xu

Abstract

Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data. Second, we introduce a theoretically-guaranteed reward model that leverages synergistic learning. We train it with a novel data augmentation method that assigns dynamic weights based on the queries, enabling contextual arbitration between safety and helpfulness. Extensive experiments show that Pragma-VL effectively balances safety and helpfulness, outperforming baselines by 5% to 20% on most multimodal safety benchmarks while preserving its general capabilities in areas such as mathematics and knowledge reasoning.

Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Abstract

Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data. Second, we introduce a theoretically-guaranteed reward model that leverages synergistic learning. We train it with a novel data augmentation method that assigns dynamic weights based on the queries, enabling contextual arbitration between safety and helpfulness. Extensive experiments show that Pragma-VL effectively balances safety and helpfulness, outperforming baselines by 5% to 20% on most multimodal safety benchmarks while preserving its general capabilities in areas such as mathematics and knowledge reasoning.
Paper Structure (24 sections, 3 theorems, 17 equations, 15 figures, 8 tables, 2 algorithms)

This paper contains 24 sections, 3 theorems, 17 equations, 15 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

If the reward function $r(y; \theta)$ is differentiable, the expected errors for the three frameworks, as specified in Definition def:error_metrics, follow the strict orderings for both MSE and Preference Error: where the subscripts correspond to the estimators $\hat{\theta}_{par}$, $\hat{\theta}_{seq}$, and $\hat{\theta}_{single}$.

Figures (15)

  • Figure 1: The dual failure modes of static safety policies in MLLMs. Our work aims to train a pragmatic model that dynamically arbitrates safety and helpfulness trade-off based on the context.
  • Figure 2: (a) Overview of Pragma-VL, which train the MLLM to perform context-aware dynamic arbitration, achieving a flexible balance between safety and helpfulness. (b) An illustration of our Contextual Data Augmentation Pipeline.
  • Figure 3: Pragma-VL Algorithm Pipeline.(a) MLLM Cold-Start (b) Prompt Regulated Reward
  • Figure 4: Ablation study of the Pragma-VL framework. Results consistently demonstrate that the full Pragma-VL framework outperforms its individual components, highlighting the synergistic effect of combining risk-aware pre-alignment with subsequent policy alignment.
  • Figure 5: (a) The distribution of items across all categories. (b) Score distributions for helpfulness, safety, and weighted metrics (top), with the corresponding word length distribution for each score bin (bottom).
  • ...and 10 more figures

Theorems & Definitions (5)

  • Definition 1: Error Metrics
  • Theorem 1: Error Ordering of Reward Model Architectures
  • proof
  • Lemma 1: UpperBound of Pair-wise Preference Error zhang2025bradleyterrymultiobjectiverewardmodeling
  • Lemma 2: Approximation of MSE from Parameter Covariance zhang2025bradleyterrymultiobjectiverewardmodeling