Table of Contents
Fetching ...

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, Yu Kong

TL;DR

ShowMe presents a unified diffusion-based framework that treats instructional image and video generation as two manifestations of action-object transformation. By a two-stage tuning strategy— disabling temporal modules for state manipulation with spatial LoRA adapters, then re-enabling temporal modules for state prediction with a separate spatiotemporal LoRA—alongside structure and motion rewards, it achieves superior results on SSv2 and Epic-Kitchens 100. The approach leverages a one-step latent denoising approximation for efficient structure-guided supervision and a KL-divergence based motion reward to ensure temporal coherence, demonstrating that video diffusion models can function as holistic action-object state transformers. Comprehensive experiments, ablations, and Ego4D supplementary results show consistent improvements over expert baselines in both instructional image and video generation, highlighting practical potential for context-aware visual instruction systems.

Abstract

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

TL;DR

ShowMe presents a unified diffusion-based framework that treats instructional image and video generation as two manifestations of action-object transformation. By a two-stage tuning strategy— disabling temporal modules for state manipulation with spatial LoRA adapters, then re-enabling temporal modules for state prediction with a separate spatiotemporal LoRA—alongside structure and motion rewards, it achieves superior results on SSv2 and Epic-Kitchens 100. The approach leverages a one-step latent denoising approximation for efficient structure-guided supervision and a KL-divergence based motion reward to ensure temporal coherence, demonstrating that video diffusion models can function as holistic action-object state transformers. Comprehensive experiments, ablations, and Ego4D supplementary results show consistent improvements over expert baselines in both instructional image and video generation, highlighting practical potential for context-aware visual instruction systems.

Abstract

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

Paper Structure

This paper contains 21 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Video diffusion models inherently capture both spatial consistency and temporal dynamics, making them well-suited for unified action-object state transformations.
  • Figure 2: Illustration of the proposed ShowMe framework. For action-object state manipulation, we integrate LoRA into the Q-Former and spatial layers while disabling the temporal layers for model fine-tuning, followed by structure reward tuning to enhance depth and edge fidelity. For state prediction, we freeze all parameters and apply spatiotemporal LoRA for joint tuning, guided by motion reward to improve motion smoothness and consistency.
  • Figure 3: Motion consistency between flow and latent magnitude.
  • Figure 4: Comparison of different methods for instructional image generation. The first three rows are test samples from SSv2, and the last three rows are from Epic100. Our method is better at completing action instructions and maintaining contextual consistency.
  • Figure 5: Visualization for the effect of structure reward tuning.
  • ...and 10 more figures