Table of Contents
Fetching ...

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Duc Vu, Anh Nguyen, Chi Tran, Anh Tran

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the *** and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Paper Structure

This paper contains 34 sections, 15 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overall pipeline of Anti-I2V. We first generate a reference video input by captions from leverage LVLM chen2024sharegpt4video and user-provided image. Then we integrate the IRA and IRC losses with the vanilla training loss as the final objective. The noise is iteratively optimized through both $L^*a^*b^*$ and frequency space.
  • Figure 2: PCA visualization of features in each layer. Features from each block are visualized at timestep 500. The first row shows features from OpenSora opensora, while the second row shows features from CogVideoX cogX. For clarity, only selected layers are highlighted.
  • Figure 3: Quanlitative comparison of Anti-I2V against baseline protections against different video generation models on UCF101. The columns present the generated outputs from the models under different adversarial attack methods.
  • Figure 4: Qualitative comparison of adversarial attack methods against against CogVideoX cogX. The first column shows the reference frame. The remaining columns present the generated outputs from models.
  • Figure 5: Qualitative comparison of adversarial attack methods against against CogVideoX cogX. The first column shows the reference frame. The remaining columns present the generated outputs from models.
  • ...and 2 more figures