Table of Contents
Fetching ...

Deep Video Codec Control for Vision Models

Christoph Reich, Biplob Debnath, Deep Patel, Tim Prangemeier, Daniel Cremers, Srimat Chakradhar

TL;DR

This paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization and demonstrates that this approach better preserves downstream deep vision performance than traditional standard video coding.

Abstract

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.

Deep Video Codec Control for Vision Models

TL;DR

This paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization and demonstrates that this approach better preserves downstream deep vision performance than traditional standard video coding.

Abstract

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.
Paper Structure (48 sections, 9 equations, 26 figures, 7 tables, 1 algorithm)

This paper contains 48 sections, 9 equations, 26 figures, 7 tables, 1 algorithm.

Figures (26)

  • Figure 1: H.264 macroblock-wise encoding example. Effect of different macroblock-wise quantization parameters on the visual quality. Video frame (left) is encoded with $\mathrm{QP}$ map (right) as an I-frame. Video data from REDS dataset Nah2019.
  • Figure 2: Vision performance vs. compression. Cityscapes segmentation accuracy and optical flow estimation performance, measured by the average endpoint error (AEPE), for different H.264 quantization parameters between the raw clip predictions (pseudo label) and the coded clip predictions. $\mathrm{QP}$ is applied uniformly.
  • Figure 3: Deep video codec control pipeline. The control network predicts high-dimensional codec parameters for an input clip and a given dynamic network bandwidth (BW) condition. The video clip is encoded using the predicted codec parameters, sent over the network to the server-side, decoded, and analyzed by a deep vision model (e.g., segmentation model). At training, the pre-trained server-side model is fixed and a differentiable surrogate model of the standard codec is used to propagate gradients from the server-side model and the file size prediction to the control network. During inference, the surrogate model is not used. Video frames from Nah2019.
  • Figure 4: Surrogate model architecture. Our model is composed of a 2D encoder (orange), a 2D decoder (blue), an MHA-based file size head, and three AGRU bottleneck blocks. We use RAFT to compute the optical flows for the AGRU blocks. For embedding the $\mathbf{qp}$ we use an MLP. Skip-connections omitted for simplicity.
  • Figure 5: Control network architecture. We use a pre-trained X3D-S followed by two conditional 3D residual blocks. The $\mathbf{qp}$ one-hot vector is obtained by using the Gumbel-Softmax trick.
  • ...and 21 more figures