Table of Contents
Fetching ...

Insights from Generative Modeling for Neural Video Compression

Ruihan Yang, Yibo Yang, Joseph Marino, Stephan Mandt

TL;DR

This work views recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling, and proposes several architectures that yield state-of-the-art video compression performance on high-resolution video.

Abstract

While recent machine learning research has revealed connections between deep generative models such as VAEs and rate-distortion losses used in learned compression, most of this work has focused on images. In a similar spirit, we view recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling. We present these codecs as instances of a generalized stochastic temporal autoregressive transform, and propose new avenues for further improvements inspired by normalizing flows and structured priors. We propose several architectures that yield state-of-the-art video compression performance on high-resolution video and discuss their tradeoffs and ablations. In particular, we propose (i) improved temporal autoregressive transforms, (ii) improved entropy models with structured and temporal dependencies, and (iii) variable bitrate versions of our algorithms. Since our improvements are compatible with a large class of existing models, we provide further evidence that the generative modeling viewpoint can advance the neural video coding field.

Insights from Generative Modeling for Neural Video Compression

TL;DR

This work views recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling, and proposes several architectures that yield state-of-the-art video compression performance on high-resolution video.

Abstract

While recent machine learning research has revealed connections between deep generative models such as VAEs and rate-distortion losses used in learned compression, most of this work has focused on images. In a similar spirit, we view recently proposed neural video coding algorithms through the lens of deep autoregressive and latent variable modeling. We present these codecs as instances of a generalized stochastic temporal autoregressive transform, and propose new avenues for further improvements inspired by normalizing flows and structured priors. We propose several architectures that yield state-of-the-art video compression performance on high-resolution video and discuss their tradeoffs and ablations. In particular, we propose (i) improved temporal autoregressive transforms, (ii) improved entropy models with structured and temporal dependencies, and (iii) variable bitrate versions of our algorithms. Since our improvements are compatible with a large class of existing models, we provide further evidence that the generative modeling viewpoint can advance the neural video coding field.

Paper Structure

This paper contains 23 sections, 14 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Model diagrams for the generative and inference procedures of the current frame ${\mathbf x}_t$, for various neural video compression methods. Random variables are shown in circles; all other quantities are deterministically computed; solid and dashed arrows describe computational dependency during generation (decoding) and inference (encoding), respectively. Purple nodes correspond to neural encoders (CNNs) and decoders (DCNNs), and green nodes implement temporal autoregressive transform. (a) TAT; (b) SSF; (c) STAT or STAT-SSF; the magenta box highlights the additional proposed scale transform absent in SSF, and the red arrow from ${\mathbf w}_t$ to ${\mathbf v}_t$; highlights the proposed (optional) structured prior. (d) SSF-TP/SSF-TP+ and (e) STAT-SSF-SP-TP+ illustrate the temporal prior extension based on our proposal; the blue arrow shows the temporal dependency on the previous residual latent ${\mathbf v}_{t-1}$, and the green arrow highlights the improved dependency on the previous reconstructed frame ${\mathbf{\hat{x}}}_{t-1}$.
  • Figure 2: Visualizing the encoding/decoding computation of the STAT-SSF-SP model on one frame of UVG video "Shake-NDry". See Fig. \ref{['fig:model-diagram']}(c) for the model's computation diagram. In this example, the warping prediction $\bm{\hat{\mu}}_t$ (bottom, first) incurs a large error around the dog's moving contour but models the mostly static background well, with the residual latents $\lfloor{\mathbf{\bar{v}}}_t\rceil$ taking up an order of magnitude higher bit-rate than $\lfloor{\mathbf{\bar{w}}}_t\rceil$. The proposed scale parameter $\bm{\hat{\sigma}}_t$ (top, second) gives the model extra flexibility when combining the noise ${\mathbf{\hat{y}}}_t$ (bottom, second) with the warping prediction $\bm{\hat{\mu}}_t$ to form the reconstruction ${\mathbf{\hat{x}}}_t = \bm{\hat{\mu}}_t + \bm{\hat{\sigma}}_t \odot {\mathbf{\hat{y}}}_t$ (bottom, fourth). The scale $\bm{\hat{\sigma}}_t$ downweights contribution from the noise ${\mathbf{\hat{y}}}_t$ in the foreground where it is very costly, and reduces the residual bit-rate $\mathcal{R}(\lfloor{\mathbf{\bar{v}}}_t\rceil)$ (and thus the overall bit-rate) compared to STAT-SSF and SSF, as illustrated in the third and fourth figures in the top row. The (BPP, PSNR) performance for STAT-SSF-SP, STAT-SSF, and SSF agustsson2020scale are (0.046, 36.97), (0.053, 36.94), and (0.075, 36.97), respectively. Thus, STAT-SSF and SSF here have comparable reconstruction quality to STAT-SSF-SP but worse bit-rate.
  • Figure 3: Rate-Distortion Performance of various models and ablations. Results are evaluated on (a) UVG and (b) MCL_JCV datasets. All the learning-based models (except VCII wu2018video) are trained on Vimeo-90k. STAT-SSF-SP-TP+ (proposed) achieves the best performance.
  • Figure 4: Qualitative comparisons of various methods on a frame from MCL-JCV video 30. Figures in the bottom row focus on the same image patch on top. Here, models with the proposed scale transform (STAT-SSF and STAT-SSF-SP) outperform the ones without, yielding visually more detailed reconstructions at lower rates. The structured prior (STAT-SSF-SP) and temporal prior (STAT-SSF-SP-TP+) reduce the bitrate further.
  • Figure 5: (a) Ablation study of STAT-SSF-SP, examining the effect of two proposed components, STAT (stochastic temporal autoregressive transform) and SP (structured prior), with R-D results evaluated on the UVG dataset. Compared to STAT-SSF-SP, SSF-SP lacks the learned elementwise scaling transform in STAT (Sec. \ref{['sec:hybrid-method-stat']}), STAT-SSF lacks the structured prior, while SSF agustsson2020scale lacks both components. See discussion in Sec. \ref{['sec:base-results']}. (b) Comparison of the Rate-Distortion performance between variable-bitrate models and non-variable-bitrate models
  • ...and 4 more figures