Table of Contents
Fetching ...

One-Click Upgrade from 2D to 3D: Sandwiched RGB-D Video Compression for Stereoscopic Teleconferencing

Yueyu Hu, Onur G. Guleryuz, Philip A. Chou, Danhang Tang, Jonathan Taylor, Rus Maxham, Yao Wang

TL;DR

This work tackles real-time stereo RGB-D compression for interactive teleconferencing by wrapping a standard video codec with neural pre- and post-processors, trained via an image codec proxy. It introduces a disparity-warping-based distortion loss and transforms depth maps into world-space coordinates to improve cross-view rendering, achieving $L_{RD} = D + \gamma R$ optimization with $R$ estimated through a JPEG proxy. The approach yields substantial bitrate savings (average BD-Rate reductions around $-\,29.3\%$ with H.264 and $-\,27.1\%$ with HEVC) while preserving novel-view rendering quality, and generalizes from synthetic 4D-people data to real-captured scenes. The method is hardware-friendly, plug-and-play with existing codecs, and scalable to higher resolutions, with potential extensions to relighting and broader 3D video communication scenarios.

Abstract

Stereoscopic video conferencing is still challenging due to the need to compress stereo RGB-D video in real-time. Though hardware implementations of standard video codecs such as H.264 / AVC and HEVC are widely available, they are not designed for stereoscopic videos and suffer from reduced quality and performance. Specific multiview or 3D extensions of these codecs are complex and lack efficient implementations. In this paper, we propose a new approach to upgrade a 2D video codec to support stereo RGB-D video compression, by wrapping it with a neural pre- and post-processor pair. The neural networks are end-to-end trained with an image codec proxy, and shown to work with a more sophisticated video codec. We also propose a geometry-aware loss function to improve rendering quality. We train the neural pre- and post-processors on a synthetic 4D people dataset, and evaluate it on both synthetic and real-captured stereo RGB-D videos. Experimental results show that the neural networks generalize well to unseen data and work out-of-box with various video codecs. Our approach saves about 30% bit-rate compared to a conventional video coding scheme and MV-HEVC at the same level of rendering quality from a novel view, without the need of a task-specific hardware upgrade.

One-Click Upgrade from 2D to 3D: Sandwiched RGB-D Video Compression for Stereoscopic Teleconferencing

TL;DR

This work tackles real-time stereo RGB-D compression for interactive teleconferencing by wrapping a standard video codec with neural pre- and post-processors, trained via an image codec proxy. It introduces a disparity-warping-based distortion loss and transforms depth maps into world-space coordinates to improve cross-view rendering, achieving optimization with estimated through a JPEG proxy. The approach yields substantial bitrate savings (average BD-Rate reductions around with H.264 and with HEVC) while preserving novel-view rendering quality, and generalizes from synthetic 4D-people data to real-captured scenes. The method is hardware-friendly, plug-and-play with existing codecs, and scalable to higher resolutions, with potential extensions to relighting and broader 3D video communication scenarios.

Abstract

Stereoscopic video conferencing is still challenging due to the need to compress stereo RGB-D video in real-time. Though hardware implementations of standard video codecs such as H.264 / AVC and HEVC are widely available, they are not designed for stereoscopic videos and suffer from reduced quality and performance. Specific multiview or 3D extensions of these codecs are complex and lack efficient implementations. In this paper, we propose a new approach to upgrade a 2D video codec to support stereo RGB-D video compression, by wrapping it with a neural pre- and post-processor pair. The neural networks are end-to-end trained with an image codec proxy, and shown to work with a more sophisticated video codec. We also propose a geometry-aware loss function to improve rendering quality. We train the neural pre- and post-processors on a synthetic 4D people dataset, and evaluate it on both synthetic and real-captured stereo RGB-D videos. Experimental results show that the neural networks generalize well to unseen data and work out-of-box with various video codecs. Our approach saves about 30% bit-rate compared to a conventional video coding scheme and MV-HEVC at the same level of rendering quality from a novel view, without the need of a task-specific hardware upgrade.
Paper Structure (29 sections, 4 equations, 11 figures, 5 tables)

This paper contains 29 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Sandwiched compression scheme for stereo RGB-D video streaming.
  • Figure 2: Neural network architecture of the preprocessor.
  • Figure 3: Warping-based depth distortion loss function.
  • Figure 4: Aggregated rendering rate-distortion curves on the synthetic testing dataset.
  • Figure 5: Aggregated rendering rate-distortion curves on The Relightables dataset.
  • ...and 6 more figures