Table of Contents
Fetching ...

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu

TL;DR

This work reveals a novel jailbreaking vulnerability in text-to-video models: latent cross-modal priors enable benign-looking prompts to yield unsafe videos. It introduces VEIL, a framework that combines an adversarial grammar with an LLM-guided zeroth-order search to compose safe-seeming anchors, auditory triggers, and stylistic modulators that collaboratively steer generation toward policy violations. By formalizing the attack as a constrained optimization with dedicated guidance oracles, VEIL achieves state-of-the-art attack success across seven T2V models and demonstrates resilience against LLM-based defenses. The findings highlight a fundamental safety gap tied to implicit world knowledge in multimodal models and underscore the need for defenses that address cross-modal associations beyond surface-level prompt safety. VEIL thus provides a rigorous platform for evaluating and diagnosing safety guardrails in modern T2V systems.

Abstract

Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models. Our demos and codes can be found at https://github.com/NY1024/VEIL.

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

TL;DR

This work reveals a novel jailbreaking vulnerability in text-to-video models: latent cross-modal priors enable benign-looking prompts to yield unsafe videos. It introduces VEIL, a framework that combines an adversarial grammar with an LLM-guided zeroth-order search to compose safe-seeming anchors, auditory triggers, and stylistic modulators that collaboratively steer generation toward policy violations. By formalizing the attack as a constrained optimization with dedicated guidance oracles, VEIL achieves state-of-the-art attack success across seven T2V models and demonstrates resilience against LLM-based defenses. The findings highlight a fundamental safety gap tied to implicit world knowledge in multimodal models and underscore the need for defenses that address cross-modal associations beyond surface-level prompt safety. VEIL thus provides a rigorous platform for evaluating and diagnosing safety guardrails in modern T2V systems.

Abstract

Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models. Our demos and codes can be found at https://github.com/NY1024/VEIL.

Paper Structure

This paper contains 39 sections, 4 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of our attack effects on T2V Models.
  • Figure 2: An overview of our proposed VEIL framework. tool operationalizes the jailbreak attack by using an LLM-guided search to find an optimal composition of individually benign components. This synergistic composition steers the T2V model to generate policy-violating content by exploiting its latent cross-modal associations.
  • Figure 3: Example jailbreak results on T2V models using baseline methods and VEIL.
  • Figure 4: Ablation results of our VEIL on Hailuo.
  • Figure 5: Ablation results on the hyperparameters.
  • ...and 5 more figures