Table of Contents
Fetching ...

S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

TL;DR

By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction.

Abstract

We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.

S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

TL;DR

By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction.

Abstract

We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.
Paper Structure (23 sections, 9 equations, 7 figures, 5 tables)

This paper contains 23 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Top: The HR-VQVAE module for hierarchical vector quantization, where each frame $i$ is encoded into $m$ hierarchical layers of quantized values $\left( Q_i^1, \dots, Q_i^m \right)$. Bottom: Illustration of the S-HR-VQVAE for video prediction, which combines HR-VQVAE with the AST-PM model. AST-PM predicts the indices of quantized values in both spatial and temporal dimensions, where each index at time $t+1$ is predicted by accessing only its preceding indices—those located above and to the left in a raster-scan spatial order and those before $t+1$ in the temporal domain.
  • Figure 2: Comparison of S-HR-VQVAE with state-of-the-art methods on KTH Human Moving Action dataset over three sequences (a, b, and c) that are commonly reported in the literature. It should be noted that 10 frames (1-10 in the figures) are given as input, and the next 20 frames (11-30 in the figures) are predicted.
  • Figure 3: Comparison of S-HR-VQVAE with state-of-the-art-methods on TrafficBJ dataset. It should be noted that 4 frames (1-4 in the figure) are given as input, and the next 4 frames (5-8 in the figure) are predicted.
  • Figure 4: Comparison of S-HR-VQVAE with state-of-the-art-methods on Human3.6M dataset. 4 frames (1-4 in the figure) are given as input, and the next 4 frames (5-8 in the figure) are predicted.
  • Figure 5: Comparison of S-HR-VQVAE with the state-of-the-art method on the Kitti dataset, where 4 frames are given as input, and the next 5 frames are predicted.
  • ...and 2 more figures