Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, Yu Kong
TL;DR
ShowMe presents a unified diffusion-based framework that treats instructional image and video generation as two manifestations of action-object transformation. By a two-stage tuning strategy— disabling temporal modules for state manipulation with spatial LoRA adapters, then re-enabling temporal modules for state prediction with a separate spatiotemporal LoRA—alongside structure and motion rewards, it achieves superior results on SSv2 and Epic-Kitchens 100. The approach leverages a one-step latent denoising approximation for efficient structure-guided supervision and a KL-divergence based motion reward to ensure temporal coherence, demonstrating that video diffusion models can function as holistic action-object state transformers. Comprehensive experiments, ablations, and Ego4D supplementary results show consistent improvements over expert baselines in both instructional image and video generation, highlighting practical potential for context-aware visual instruction systems.
Abstract
Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.
