Table of Contents
Fetching ...

Video Imagination from a Single Image with Transformation Generation

Baoyang Chen, Wenmin Wang, Jinzhuo Wang, Xiongtao Chen

TL;DR

This work tackles generating multiple plausible imaginary videos from a single image by modeling motion in transformation space rather than pixel space. It introduces a transformation generator that produces sequences of affine or convolutional transformations conditioned on the input image and a latent variable, a volumetric merge network that reconstructs frames from transformed intermediates, and a video critic trained with a Wasserstein GAN objective. The approach is evaluated on Moving MNIST, synthetic 2D shapes, and UCF101, with a novel RIQA metric to assess reconstruction quality, and shows diverse, sharp five-frame videos that respect temporal coherence. The findings demonstrate that transformation-space modeling enables tractable, unsupervised learning of multi-modal video imagination with strong perceptual quality and diversity across data regimes.

Abstract

In this work, we focus on a challenging task: synthesizing multiple imaginary videos given a single image. Major problems come from high dimensionality of pixel space and the ambiguity of potential motions. To overcome those problems, we propose a new framework that produce imaginary videos by transformation generation. The generated transformations are applied to the original image in a novel volumetric merge network to reconstruct frames in imaginary video. Through sampling different latent variables, our method can output different imaginary video samples. The framework is trained in an adversarial way with unsupervised learning. For evaluation, we propose a new assessment metric $RIQA$. In experiments, we test on 3 datasets varying from synthetic data to natural scene. Our framework achieves promising performance in image quality assessment. The visual inspection indicates that it can successfully generate diverse five-frame videos in acceptable perceptual quality.

Video Imagination from a Single Image with Transformation Generation

TL;DR

This work tackles generating multiple plausible imaginary videos from a single image by modeling motion in transformation space rather than pixel space. It introduces a transformation generator that produces sequences of affine or convolutional transformations conditioned on the input image and a latent variable, a volumetric merge network that reconstructs frames from transformed intermediates, and a video critic trained with a Wasserstein GAN objective. The approach is evaluated on Moving MNIST, synthetic 2D shapes, and UCF101, with a novel RIQA metric to assess reconstruction quality, and shows diverse, sharp five-frame videos that respect temporal coherence. The findings demonstrate that transformation-space modeling enables tractable, unsupervised learning of multi-modal video imagination with strong perceptual quality and diversity across data regimes.

Abstract

In this work, we focus on a challenging task: synthesizing multiple imaginary videos given a single image. Major problems come from high dimensionality of pixel space and the ambiguity of potential motions. To overcome those problems, we propose a new framework that produce imaginary videos by transformation generation. The generated transformations are applied to the original image in a novel volumetric merge network to reconstruct frames in imaginary video. Through sampling different latent variables, our method can output different imaginary video samples. The framework is trained in an adversarial way with unsupervised learning. For evaluation, we propose a new assessment metric . In experiments, we test on 3 datasets varying from synthetic data to natural scene. Our framework achieves promising performance in image quality assessment. The visual inspection indicates that it can successfully generate diverse five-frame videos in acceptable perceptual quality.

Paper Structure

This paper contains 23 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Synthesizing multiple imaginary videos from one single image. For instance, given an image of a dancing ballerina, the videos of the dancer jumping higher or landing softly are both plausible imaginary videos. Those videos can be synthesized through applying a sequence of transformations to the original image.
  • Figure 2: Pipeline of video imagination from single image. In our framework, to produce one imaginary video, the input image is first encoded into a condition code and sent to transformation generator together with a latent variable. The generated transformation sequence is applied to input image later in volumetric merge network where frames are reconstructed with transformed images and volumetric kernels. Those four frames form one imaginary video. By sampling different latent variable from guassian distribution, our framework can produce diverse imaginary videos.
  • Figure 3: Different convolution kernels result in different motions. The doted square denotes convolution kernel and the right side image shows the result of applying the kernel. One simple kernel can model motion like b) translation c) Zoom d) Warp.
  • Figure 4: Intermediate image sequence $I_T$ as 3-D entity. A volumetric kernel can take both neighbor pixel values and intermediate image differences into consideration.
  • Figure 5: Quality Performance of our framework. In each dotted box, the first shows the synthesized imaginary videos given the fist frame as input. The second row shows the difference images of synthesized frames and input. (a)(b) demonstrate the results experiment on moving MNIST and 2D shapes dataset. (c)(d) shows the result of surfing class on UCF101 dataset in different resolution as c) $64 \times 64$ and d) $128 \times 128$. (e)(f) shows the results given image from swing and ice-dancing categories in UCF101 dataset. The synthesized frames are sharp and clear. Difference images illustrate plausible motions. Results of different resolutions and different image categories on UCF101 dataset suggest our framework shows scale to the complexity of high-resolution videos.
  • ...and 5 more figures