Table of Contents
Fetching ...

Low-Latency Neural Stereo Streaming

Qiqi Hou, Farzad Farhadzadeh, Amir Said, Guillaume Sautiere, Hoang Le

TL;DR

This work tackles the latency bottleneck in neural stereo video codecs by moving from sequential inter-view disparity compensation to a parallel, cross-view-aware architecture. It introduces two parallel autoencoder branches for left and right views and a Bidirectional Shift Module that learns cross-view redundancy, enabling real-time-like processing without sacrificing compression efficiency. LLSS delivers substantial BD-rate savings on CityScapes and KITTI benchmarks and demonstrates lower computational complexity than prior methods like LSVC, indicating strong practicality for VR and autonomous-vehicle use cases. The approach advances stereo video coding by efficiently exploiting cross-view information in a parallelizable framework with end-to-end trainability, potentially influencing future real-time multi-view compression systems.

Abstract

The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.

Low-Latency Neural Stereo Streaming

TL;DR

This work tackles the latency bottleneck in neural stereo video codecs by moving from sequential inter-view disparity compensation to a parallel, cross-view-aware architecture. It introduces two parallel autoencoder branches for left and right views and a Bidirectional Shift Module that learns cross-view redundancy, enabling real-time-like processing without sacrificing compression efficiency. LLSS delivers substantial BD-rate savings on CityScapes and KITTI benchmarks and demonstrates lower computational complexity than prior methods like LSVC, indicating strong practicality for VR and autonomous-vehicle use cases. The approach advances stereo video coding by efficiently exploiting cross-view information in a parallelizable framework with end-to-end trainability, potentially influencing future real-time multi-view compression systems.

Abstract

The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.
Paper Structure (16 sections, 4 equations, 8 figures, 1 table)

This paper contains 16 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Comparison of multi-view compression strategies. In contrast to LSVC Chen2022-xe, our approach processes the left and right frames simultaneously. This parallel processing not only facilitates more rate-efficient coding, it also reduces the latency between the left and right views.
  • Figure 2: Overall architecture of our network. It contains two branches dedicated to processing the left and right view. It incorporates a parallel motion autoencoder and a parallel context autoencoder to reduce the redundant motion and context information across views, respectively. The weights are shared across views, including the feature extraction module, the motion estimation module, the motion compensation module, and the image reconstruction module.
  • Figure 3: Mutual information between cross-view motion latents. $I(\mathbf{Y}^R; \mathbf{Y}^L)=-1/2\log_2(1-\rho^2)$ for a joint Gaussian distribution with a normalized cross-correlation $\rho$.
  • Figure 4: The architecture of a parallel autoencoder. It contains two parallel branches to compress the left and right features at the same time. The format reads "BlockType(channel, kernel_size, stride)". The Bidirectional Shift Module (BiShiftMod) is designed to learn the correlation between the left and right branches. It shifts the left and right features bidirectionally, estimating the Groupwise Correlation (GroupCor) features and Concatenation-based Correlation (CatCor) features between them. We omit activation layers for conciseness.
  • Figure 5: Rate-distortion curves in terms of PSNR and MS-SSIM on the CityScapes Cordts2016-de , KITTI 2012 Geiger2012-cy and KITTI 2015 Menze2015-rw datasets.
  • ...and 3 more figures