Table of Contents
Fetching ...

Video Creation by Demonstration

Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu

TL;DR

$delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction, adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos.

Abstract

We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $δ$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $δ$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.

Video Creation by Demonstration

TL;DR

-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction, adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos.

Abstract

We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present -Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, -Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.

Paper Structure

This paper contains 38 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Video Creation by Demonstration. Given a demonstration video, our proposed $\delta$-Diffusion generates a video that naturally continues from a context image and carries out the same action concepts.
  • Figure 2: (a) Overview of $\delta$-Diffusion. The context frame $I$ is provided to the generation model $\mathcal{G}$ along with the action latents $\delta_V$ extracted from the demonstration video $V$. (b) Extracting action latents. A spatial-temporal vision encoder is applied to extract temporally-aggregated spatiotemopral representations $\mathbf{z}$ from an input video $V$, with $t$ denoting the temporal dimension. In parallel, a spatial vision encoder extracts per-frame representations from $V$, which is aligned to $\mathbf{z}$ by feature predictor $\mathcal{P}$ as $\mathbf{h}$. The appearance bottleneck then computes the action latents $\delta_V$ by subtracting the aligned spatial representations $\mathbf{h}$ from the spatiotemporal representations.
  • Figure 3: Qualitative results for bottleneck ablation on the Something-Something v2 dataset goyal2017something. Applying no or temporal normalization bottleneck suffers from appearance leakage, while generation based on our appearance bottleneck preserves the input context.
  • Figure 4: Qualitative comparisons of $\delta$-Diffusion against MotionDirector zhao2023motiondirector and WALT walt on (a) Something-Something v2 goyal2017something, (b) Epic Kitchens 100 ek100, and (c) Fractal fractal datasets.
  • Figure 5: Auto-regressive generation controlled via a concatenation of three different demonstration videos of varying lengths. The sequence of demonstrated action concepts ("picking something from a drawer and placing it on the table", "closing a drawer", and "opening a drawer") are coherently transferred to the input context.
  • ...and 5 more figures