VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language
Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu
TL;DR
This work reveals a novel jailbreaking vulnerability in text-to-video models: latent cross-modal priors enable benign-looking prompts to yield unsafe videos. It introduces VEIL, a framework that combines an adversarial grammar with an LLM-guided zeroth-order search to compose safe-seeming anchors, auditory triggers, and stylistic modulators that collaboratively steer generation toward policy violations. By formalizing the attack as a constrained optimization with dedicated guidance oracles, VEIL achieves state-of-the-art attack success across seven T2V models and demonstrates resilience against LLM-based defenses. The findings highlight a fundamental safety gap tied to implicit world knowledge in multimodal models and underscore the need for defenses that address cross-modal associations beyond surface-level prompt safety. VEIL thus provides a rigorous platform for evaluating and diagnosing safety guardrails in modern T2V systems.
Abstract
Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models. Our demos and codes can be found at https://github.com/NY1024/VEIL.
