Table of Contents
Fetching ...

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

TL;DR

GenHowTo tackles the challenge of generating temporally coherent and physically plausible images of actions and object state transformations from an initial image and textual prompts. It leverages a large-scale dataset mined from instructional videos (≈200k 5-tuples) and trains two diffusion-based models (one for actions, one for final states) conditioned on both the input image and text, using semantic conditioning to preserve background while modifying target objects. The method demonstrates superior quantitative performance and compelling qualitative results, outperforming baselines on unseen categories and approaching real-image quality when trained with held-out categories. This work enables more realistic intermediate-goal image generation with strong scene consistency, which has practical implications for robotics, planning, and image editing.

Abstract

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

TL;DR

GenHowTo tackles the challenge of generating temporally coherent and physically plausible images of actions and object state transformations from an initial image and textual prompts. It leverages a large-scale dataset mined from instructional videos (≈200k 5-tuples) and trains two diffusion-based models (one for actions, one for final states) conditioned on both the input image and text, using semantic conditioning to preserve background while modifying target objects. The method demonstrates superior quantitative performance and compelling qualitative results, outperforming baselines on unseen categories and approaching real-image quality when trained with held-out categories. This work enables more realistic intermediate-goal image generation with strong scene consistency, which has practical implications for robotics, planning, and image editing.

Abstract

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.
Paper Structure (16 sections, 1 equation, 19 figures, 3 tables)

This paper contains 16 sections, 1 equation, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Given an image of an initial scene (red) and text prompts (bold), GenHowTo generates images corresponding to the action (blue) and the final state when action is completed (yellow). GenHowTo is learned from instructional videos and can generate new images of both seen and previously unseen object transformations. Importantly, GenHowTo learns to maintain the parts of the scene that showcase the action carried out in the same environment, in the spirit of HowTo examples, while introducing important objects (e.g., hand and knife in the first example) and transforming the object according to the prompt.
  • Figure 2: Method overview. We use a self-supervised model to detect objects before, during, and after they are manipulated in instructional videos (top left). Then, the detected frames are automatically annotated using an image captioning model (top right). Finally, the detected frames with the text annotations are used to train our two diffusion models for transforming objects in the images (bottom).
  • Figure 3: GenHowTo model overview. The model $\epsilon_\theta$ takes as input (left) a frame depicting the object in its initial state $\mathcal{I}$ and a text prompt $\mathcal{P}$ describing an action or the desired final state. The output of the model is an image $\mathcal{I}^*$ of the same scene but depicting the action or the desired final state.
  • Figure 4: Various model predictions for moments from instructional videos unseen during training. We generate the action (blue) and the final state of the object (yellow) given the initial state image (red) and the corresponding text prompt (bold) as the input. Our method correctly models hands interacting with objects (top left) and preserves scene elements such as the cutting board (bottom right). The method can also introduce tools, such as a knife, into the scene to fulfill the prompts, e.g., the slicing action (bottom right).
  • Figure 5: Long Term Generation. Each image is generated recurrently using the image to the left and the prompt (top). The leftmost image is a real photo. Even though there are some compounding artifacts, especially when large changes to the scene are required (e.g., a blender $\to$ a glass), our model can generate plausible chains of transformations while preserving the scene.
  • ...and 14 more figures