Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis
Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, Bin Wang
TL;DR
This work introduces DRA-Ctrl, a framework for repurposing large-scale video generative models to perform controllable image generation across spatially-aligned, subject-driven, and style-transfer tasks. It tackles the gap between continuous video dynamics and discrete image targets with a mixup-based shot transition, Frame Skip Position Embedding, and a targeted attention masking strategy, all built on a HunyuanVideo-I2V backbone and a 3D-VAE/MLLM/transformer architecture. Empirically, DRA-Ctrl consistently outperforms image-trained baselines in controllability and subject adherence while achieving notable efficiency gains through FSPE-enabled long-range transitions. The results highlight the untapped potential of video priors to enhance cross-modal generation and suggest a pathway toward unified generative models across visual modalities.
Abstract
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.
