Table of Contents
Fetching ...

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, Bin Wang

TL;DR

This work introduces DRA-Ctrl, a framework for repurposing large-scale video generative models to perform controllable image generation across spatially-aligned, subject-driven, and style-transfer tasks. It tackles the gap between continuous video dynamics and discrete image targets with a mixup-based shot transition, Frame Skip Position Embedding, and a targeted attention masking strategy, all built on a HunyuanVideo-I2V backbone and a 3D-VAE/MLLM/transformer architecture. Empirically, DRA-Ctrl consistently outperforms image-trained baselines in controllability and subject adherence while achieving notable efficiency gains through FSPE-enabled long-range transitions. The results highlight the untapped potential of video priors to enhance cross-modal generation and suggest a pathway toward unified generative models across visual modalities.

Abstract

Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

TL;DR

This work introduces DRA-Ctrl, a framework for repurposing large-scale video generative models to perform controllable image generation across spatially-aligned, subject-driven, and style-transfer tasks. It tackles the gap between continuous video dynamics and discrete image targets with a mixup-based shot transition, Frame Skip Position Embedding, and a targeted attention masking strategy, all built on a HunyuanVideo-I2V backbone and a 3D-VAE/MLLM/transformer architecture. Empirically, DRA-Ctrl consistently outperforms image-trained baselines in controllability and subject adherence while achieving notable efficiency gains through FSPE-enabled long-range transitions. The results highlight the untapped potential of video priors to enhance cross-modal generation and suggest a pathway toward unified generative models across visual modalities.

Abstract

Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.

Paper Structure

This paper contains 33 sections, 4 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: This paper leverages high-level prior of video generative models to unify controllable image generation in low-level. Bottom results show various types of task supported by DRA-Ctrl.
  • Figure 2: The training framework of DRA-Ctrl. We propose a mixup-based transition strategy to construction shot transition videos to adapt the video model for abrupt image changes, with FSPE strategically reducing transitional frames. The loss function is adaptively reweighted according to the proportion of target image in the token sequence. Besides, to align text prompts with image-level control, we design an attention masking mechanism.
  • Figure 2: Quantitative results on DreamBench. The best and second best values of each metric are highlighted.
  • Figure 3: The inference process of T2V/I2V models and their finetuned subject-driven image generation models. By treating the condition and target images directly as a two-frame video and fine-tuning T2V/I2V models accordingly, the corresponding T2V/I2V baselines can be obtained.
  • Figure 4: Qualitative results comparing different methods.
  • ...and 16 more figures