Table of Contents
Fetching ...

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue

TL;DR

ConceptGuard addresses proactive safety in Text-and-Image-to-Video (TI2V) generation by detecting latent multimodal risks and suppressing unsafe concepts before or during generation. It introduces ConceptRisk, a large-scale concept-level multimodal safety dataset, and T2VSafetyBench-TI2V to assess cross-modal generalization, along with a two-stage framework: (1) a contrastive risk detector that fuses image and text into a structured risk space, and (2) a semantic suppression mechanism that edits prompt embeddings and the visual foundation to steer generation away from harm. The approach achieves state-of-the-art results in multimodal risk detection (e.g., 0.976 overall accuracy) and safe video generation on challenging TI2V tasks, while maintaining fidelity to user intent. Together, these contributions provide a scalable, interpretable safety toolkit and rigorous evaluation for advancing safer TI2V technologies.

Abstract

Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

TL;DR

ConceptGuard addresses proactive safety in Text-and-Image-to-Video (TI2V) generation by detecting latent multimodal risks and suppressing unsafe concepts before or during generation. It introduces ConceptRisk, a large-scale concept-level multimodal safety dataset, and T2VSafetyBench-TI2V to assess cross-modal generalization, along with a two-stage framework: (1) a contrastive risk detector that fuses image and text into a structured risk space, and (2) a semantic suppression mechanism that edits prompt embeddings and the visual foundation to steer generation away from harm. The approach achieves state-of-the-art results in multimodal risk detection (e.g., 0.976 overall accuracy) and safe video generation on challenging TI2V tasks, while maintaining fidelity to user intent. Together, these contributions provide a scalable, interpretable safety toolkit and rigorous evaluation for advancing safer TI2V technologies.

Abstract

Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.

Paper Structure

This paper contains 64 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: ConceptGuard effectively safeguards against multimodal risks that evade existing methods. (a) Given an unsafe image and unsafe text, a standard generative model produces Not-Safe-for-Work (NSFW) content, whereas ConceptGuard generates a safe video. (b) In a more challenging scenario with an unsafe image and a safe text prompt, a text-only safety guard is ineffective as it cannot perceive the visual risk. In contrast, ConceptGuard identifies the unsafe visual input and steers the generation process toward a safe outcome. This highlights ConceptGuard's superior capability in handling both compositional and single-modality visual risks.
  • Figure 2: Overview of the ConceptGuard framework. It consists of two stages: (1) Multimodal Risk Detection, where image-text pairs are processed by a CLIP encoder and a detection module with cross-attention and gating to produce a fused representation, which is scored against unsafe concept embeddings; and (2) Semantic Risk Suppression, where the top-$k$ detected risks define a semantic subspace used to suppress unsafe token embeddings during video generation.
  • Figure 3: Ablation study results on the ConceptRisk test set. We report accuracy for each scenario to demonstrate the impact of removing key components. The results confirm our full model outperforms all variants, validating the effectiveness of our design.
  • Figure 4: Qualitative examples of ConceptGuard. For unsafe inputs covering violence (bombing) and illegal activities (bribery), our full framework successfully suppresses the harmful semantics and generates safe videos, while the uncontrolled model produces unsafe content.
  • Figure 5: The prompt template used for constructing the ConceptRisk dataset using LLMs.
  • ...and 4 more figures