Table of Contents
Fetching ...

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang, Qinzhi Peng, Gan Xu, Yuxuan Lu, Yongjun Shen

TL;DR

This work formalizes safety in multimodal multi-turn dialogues for Vision-Language Models and introduces MMDS, the first benchmark for MMT dialogue safety, constructed via a comprehensive MMRT-MCTS red-teaming pipeline. It further presents LLaVAShield, a dedicated multimodal multi-turn safety assessor that jointly detects risk in user inputs and assistant responses conditioned on policy dimensions, achieving state-of-the-art results and robust performance under dynamic policy configurations. The MMDS dataset and MMRT-MCTS framework provide a scalable foundation for rigorous evaluation and defense design, while the rationale-generation component enhances explainability and traceability of safety decisions. Overall, the approach significantly advances practical MMT moderation by highlighting cross-turn and cross-modal risks and offering a robust, adaptable solution for real-world deployments.

Abstract

As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

TL;DR

This work formalizes safety in multimodal multi-turn dialogues for Vision-Language Models and introduces MMDS, the first benchmark for MMT dialogue safety, constructed via a comprehensive MMRT-MCTS red-teaming pipeline. It further presents LLaVAShield, a dedicated multimodal multi-turn safety assessor that jointly detects risk in user inputs and assistant responses conditioned on policy dimensions, achieving state-of-the-art results and robust performance under dynamic policy configurations. The MMDS dataset and MMRT-MCTS framework provide a scalable foundation for rigorous evaluation and defense design, while the rationale-generation component enhances explainability and traceability of safety decisions. Overall, the approach significantly advances practical MMT moderation by highlighting cross-turn and cross-modal risks and offering a robust, adaptable solution for real-world deployments.

Abstract

As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.

Paper Structure

This paper contains 57 sections, 7 equations, 4 figures, 4 tables, 5 algorithms.

Figures (4)

  • Figure 1: Example of UMMD. The example illustrates the contextual risk accumulation effects between user inputs and assistant outputs, as well as the joint risk amplification mechanisms across modalities and dialogue turns. darker yellow indicates greater obfuscation and potential harm in the user’s malicious intent; darker red indicates a higher risk level in the AI assistant’s responses.
  • Figure 2: Overall workflow. (a) Generate adversarial intents and retrieve the associated images along with their textual descriptions; (b) MMDS dataset generation pipeline; (c) LLaVAShield conducts safety assessment for multimodal multi-turn dialogues conditioned on the specified policy dimensions.
  • Figure 3: Taxonomy of Safety Policy Dimensions.
  • Figure 4: MMDS test set distributional analysis.