Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

Xue Bai; Tasmiah Haque; Sumit Mohan; Yuliang Cai; Byungheon Jeong; Adam Halasz; Srinjoy Das

Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

Xue Bai, Tasmiah Haque, Sumit Mohan, Yuliang Cai, Byungheon Jeong, Adam Halasz, Srinjoy Das

TL;DR

The results show the effectiveness of the proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.

Abstract

We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine transformations. Keypoints are extracted by a self-supervised keypoint detector and organized in a time series corresponding to the video frames. Prediction of keypoints, to enable transmission using lower frames per second on the source device, is performed using a Variational Recurrent Neural Network (VRNN). The predicted keypoints are then synthesized to video frames using an optical flow estimator and a generator network. This efficacy of leveraging keypoint based representations in conjunction with VRNN based prediction for both video animation and reconstruction is demonstrated on three diverse datasets. For real-time applications, our results show the effectiveness of our proposed architecture by enabling up to 2x additional bandwidth reduction over existing keypoint based video motion transfer frameworks without significantly compromising video quality.

Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

TL;DR

Abstract

Paper Structure (14 sections, 11 equations, 8 figures, 6 tables)

This paper contains 14 sections, 11 equations, 8 figures, 6 tables.

Introduction
Related Works
The Proposed Pipeline
Recurrent Neural Network (RNN)
Variational Autoencoder (VAE)
Variational Recurrent Neural Network (VRNN)
Training and Inference
Experimental Results
Datasets
Evaluation Procedures
Results on Mgif dataset
Results on Bair dataset
Results on VoxCeleb dataset
Conclusions and Future Directions

Figures (8)

Figure 1: Components of our proposed pipeline for keypoint prediction and video synthesis
Figure 2: Graphical representation of VRNN describing the dependencies between the variables in Eqs. (7)-(10). The green arrows correspond to the computations involving the (conditional) prior and posterior on $\bm z_t$. The blue arrows show the computations involving the generative network. The computations for $\bm h_t$ are shown with red arrows.
Figure 3: Qualitative results for the Mgif dataset in reconstruction mode (upper panel) and transfer mode (lower panel). In each panel, consecutive frames generated using only FOMM are shown in the second row, and FOMM with keypoints prediction using RNN, VAE, and VRNN are shown in the third, fourth and fifth rows respectively. For reconstruction mode, the first row serves as the ground truth whereas for transfer mode the second row serves as the ground truth.
Figure 4: Qualitative results for the Bair dataset in reconstruction mode (upper panel) and transfer mode (lower panel). In each panel, consecutive frames generated using only FOMM are shown in the second row, and FOMM with keypoints prediction using RNN, VAE, and VRNN are shown in the third, fourth and fifth rows respectively. For reconstruction mode, the first row serves as the ground truth whereas for transfer mode the second row serves as the ground truth. The circles in both figures are examples of regions where VRNN performs better than RNN and VAE.
Figure 5: Qualitative results for the VoxCeleb dataset in reconstruction mode (upper panel) and transfer mode (lower panel). In each panel, consecutive frames generated using only FOMM are shown in the second row, and FOMM with keypoints prediction using RNN, VAE, and VRNN are shown in the third, fourth and fifth rows respectively. For reconstruction mode, the first row serves as the ground truth whereas for transfer mode the second row serves as the ground truth.
...and 3 more figures

Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

TL;DR

Abstract

Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)