Table of Contents
Fetching ...

Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

Zhening Liu, Xinjie Zhang, Jiawei Shao, Zehong Lin, Jun Zhang

TL;DR

A symmetric bidirectional stereo image compression architecture, named BiSIC, is introduced that outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.

Abstract

With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.

Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

TL;DR

A symmetric bidirectional stereo image compression architecture, named BiSIC, is introduced that outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.

Abstract

With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.
Paper Structure (25 sections, 13 equations, 17 figures, 4 tables)

This paper contains 25 sections, 13 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: The proposed bidirectional stereo image compression architecture. AE and AD are arithmetic coder for entropy coding. The backbones of encoder $\mathbf{E}$, decoder $\mathbf{D}$, hyper encoder $h_a$, and hyper decoder $h_s$ are constructed with 3D convolution to model local features. 3DConvTr denotes transposed 3D convolution. To enhance global feature extraction, bidirectional mutual attention blocks are inserted between 3D convolutional layers. Moreover, a novel cross-dimensional entropy model is utilized to capture complex inter-view dependencies.
  • Figure 2: Overview of the proposed bidirectional mutual attention block. The blue and green lines represent features extracted from the left view and right view, respectively. The network structures of the residual block, the basic efficient attention unit, and the combine block are illustrated on the right, from top to bottom.
  • Figure 3: Illustration of the proposed symmetric cross-dimensional entropy model. It jointly aggregates the hyperprior (blue), stereo spatial context (red), and stereo channel context (yellow) as conditions for an effective probability distribution estimation. The hyper decoder ${h}_s$ for hyperprior is shown in \ref{['fig:framework']}. The masked 3D convolution for spatial context is provided in \ref{['fig:Masked3D']}. Moreover, the mutual attention block for channel context is detailed in \ref{['fig:mutualatten']}.
  • Figure 4: (Left) Illustration of masked 3D convolution. The blue region is used as the condition for the probability estimation of the current target depicted in red. The weights of the convolution kernel in blue are valid, while the white region is masked to maintain the causal spatial context. (Middle) Demonstration of the auto-regressive process in the cross-dimensional entropy model. The yellow and red arrows represent the channel context and spatial context, respectively. The spatial context is processed entry by entry. (Right) Demonstration of the stereo-checkerboard in the fast variant, where the blue and red parts are each processed only once.
  • Figure 5: Illustration of the entropy model for the proposed fast variant based on stereo-checkerboard. The blue and red parts represent the stereo anchor and stereo non-anchor, respectively. By concurrently processing two views, the 3D convolution based anchor context network effectively captures the dependency from the stereo anchor.
  • ...and 12 more figures