Video Imagination from a Single Image with Transformation Generation
Baoyang Chen, Wenmin Wang, Jinzhuo Wang, Xiongtao Chen
TL;DR
This work tackles generating multiple plausible imaginary videos from a single image by modeling motion in transformation space rather than pixel space. It introduces a transformation generator that produces sequences of affine or convolutional transformations conditioned on the input image and a latent variable, a volumetric merge network that reconstructs frames from transformed intermediates, and a video critic trained with a Wasserstein GAN objective. The approach is evaluated on Moving MNIST, synthetic 2D shapes, and UCF101, with a novel RIQA metric to assess reconstruction quality, and shows diverse, sharp five-frame videos that respect temporal coherence. The findings demonstrate that transformation-space modeling enables tractable, unsupervised learning of multi-modal video imagination with strong perceptual quality and diversity across data regimes.
Abstract
In this work, we focus on a challenging task: synthesizing multiple imaginary videos given a single image. Major problems come from high dimensionality of pixel space and the ambiguity of potential motions. To overcome those problems, we propose a new framework that produce imaginary videos by transformation generation. The generated transformations are applied to the original image in a novel volumetric merge network to reconstruct frames in imaginary video. Through sampling different latent variables, our method can output different imaginary video samples. The framework is trained in an adversarial way with unsupervised learning. For evaluation, we propose a new assessment metric $RIQA$. In experiments, we test on 3 datasets varying from synthetic data to natural scene. Our framework achieves promising performance in image quality assessment. The visual inspection indicates that it can successfully generate diverse five-frame videos in acceptable perceptual quality.
