Table of Contents
Fetching ...

Disparity-based Stereo Image Compression with Aligned Cross-View Priors

Yongqi Zhai, Luyang Tang, Yi Ma, Rui Peng, Ronggang Wang

TL;DR

This work addresses the inefficiency of conventional stereo image compression by leveraging a disparity-based prediction framework. DispSIC jointly trains a disparity estimator and a three-branch auto-encoder (right image, disparity, left residuals) and introduces aligned cross-view priors in a conditional entropy model to better capture cross-view correlations. The approach achieves superior rate-distortion performance on KITTI and InStereo2K, with substantial bitrate savings over prior methods like HESIC while maintaining or improving image quality. The method reduces computational complexity by using a lightweight disparity-based warping and demonstrates effective adaptive bitrate allocation among the three encoded components.

Abstract

With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.

Disparity-based Stereo Image Compression with Aligned Cross-View Priors

TL;DR

This work addresses the inefficiency of conventional stereo image compression by leveraging a disparity-based prediction framework. DispSIC jointly trains a disparity estimator and a three-branch auto-encoder (right image, disparity, left residuals) and introduces aligned cross-view priors in a conditional entropy model to better capture cross-view correlations. The approach achieves superior rate-distortion performance on KITTI and InStereo2K, with substantial bitrate savings over prior methods like HESIC while maintaining or improving image quality. The method reduces computational complexity by using a lightweight disparity-based warping and demonstrates effective adaptive bitrate allocation among the three encoded components.

Abstract

With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.
Paper Structure (16 sections, 9 equations, 8 figures, 4 tables)

This paper contains 16 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Brief framework of our proposed DispSIC. A three-branches is adopted, which encodes the input right image, the disparity map and the residuals of the left image respectively.
  • Figure 2: The overall network architecture of our proposed method (DispSIC). We compress the left and right images jointly, and use the disparity map to explicitly represent the pixel-wise correlation between views to save bitrates. C represents the concatenation operation, Q represents quantization.
  • Figure 3: Visual comparisons of images generated by disparity-based warping and homography-based transformation. The blue box shows that the homography-based prediction image $x_{r{\rightarrow}l}^H$ misses some border pixels. The green boxes show that the "Bicyclist" and "Pole" areas in $x_{r{\rightarrow}l}^H$ misalign with the original texture.
  • Figure 4: Visualization of latent codes of the left and right images. The tick marks represent the coordinates of the pixel.
  • Figure 5: Our conditional entropy model used to encode the quantized latent $\hat{y}_l$. ${ENC}_H$ and ${DEC}_H$ represent the hyperprior encoder and decoder. AE and AD are the arithmetic encoder and arithmetic decoder.
  • ...and 3 more figures