CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

Xinjie Zhang; Shenyuan Gao; Zhening Liu; Jiawei Shao; Xingtong Ge; Dailan He; Tongda Xu; Yan Wang; Jun Zhang

CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

Xinjie Zhang, Shenyuan Gao, Zhening Liu, Jiawei Shao, Xingtong Ge, Dailan He, Tongda Xu, Yan Wang, Jun Zhang

TL;DR

CAMSIC addresses the inefficiency of existing stereo image codecs that rely on entropy models derived from single-image codecs by introducing a content-aware masked image modeling (MIM) Transformer entropy model. The method uses independent encoders for each view and a decoder-free Transformer that, via content-aware tokens generated from priors, enables bidirectional interaction between prior information and estimated tokens, eliminating the need for a separate decoder. Key contributions include the content-aware MIM style, a decoder-free Transformer entropy model, and a simple yet effective encoder-decoder framework that yields state-of-the-art rate-distortion on Cityscapes and InStereo2K with fast encoding/decoding. The approach demonstrates the practical impact of advanced entropy modeling for stereo compression, enabling higher efficiency without complex disparity-warped architectures.

Abstract

Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed. Code is available at https://github.com/Xinjie-Q/CAMSIC.

CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 10 figures, 2 tables)

This paper contains 16 sections, 8 equations, 10 figures, 2 tables.

Introduction
Related Works
Method
Overview of CAMSIC
Content-irrelevant Encoder-decoder Transformer Entropy Model
Content-aware Decoder-free Transformer Entropy Model
Training and Inference
Experiments
Experimental Setup
Experimental Results
Ablation Study
Conclusion
Acknowledgments
Network Structure
Iterative Decoding Process
...and 1 more sections

Figures (10)

Figure 1: Comparison of two different mask image modeling Transformer entropy models. [MASK] token is a learnable parameter, which is irrelevant to specific image content. Tokens within the same red rectangle have bidirectional interactions. The vanilla content-irrelevant masked image modeling uses uniform [MASK] tokens to fill the unestimated positions and relies on an inefficient encoder-decoder Transformer to propagate the prior information. Differently, our content-aware masked image modeling replaces the uninformative [MASK] tokens with proposed content-aware tokens, which enables an efficient decoder-free Transformer architecture with more sufficient information interactions.
Figure 2: Our overall stereo image compression framework. Given an input image $\boldsymbol{x}_v$, we first encode the input view to produce the latent representation $\boldsymbol{y}_v$ that is quantized to $\boldsymbol{\hat{y}}_v$. To transmit $\boldsymbol{\hat{y}}_v$ with fewer bits, an MIM-based Transformer entropy model is introduced to predict a probability distribution for $\boldsymbol{\hat{y}}_v$ given hyperprior $\boldsymbol{\hat{z}}_v$, and previously stored representations $\boldsymbol{\hat{y}}_{v-1}$. Finally, the quantized representation $\boldsymbol{\hat{y}}_v$ is decoded to the reconstructed view $\boldsymbol{\hat{x}}_v$. When compressing the first view, a learned embedding $\boldsymbol{\hat{y}}_{-1}$ is used. The arithmetic encoder and decoder after the hyperprior encoder are omitted for brevity.
Figure 3: Structures of content-irrelevant encoder-decoder Transformer entropy model and our content-aware decoder-free Transformer entropy model. HPD indicates hyperprior decoder. EPM means entropy parameter module. The mask $\boldsymbol{m}_k$ starts at zero and is then updated iteratively by a scheduler-based generator during inference. In each iteration, a sinusoidal scheduler computes the number of tokens $n_k$ to be encoded/decoded. The mask generator then sets the values of the $n_k$ tokens with the lowest estimated bitrates and the previously encoded/decoded tokens as 1 to update $\boldsymbol{m}_k$. (a) By using $\boldsymbol{m}_k$ to mix the [MASK] token and $\boldsymbol{\hat{y}}_v$, a composite context $\boldsymbol{s}_v^k$ is synthesized and sent to a projection module, where two identity embeddings are added to further different token types. Both the prior information and the composite context are input into a Transformer decoder to estimate the probability distribution parameters $(\boldsymbol{\mu}_v^{m_k}, \boldsymbol{\sigma}_v^{m_k})$ for the masked tokens. (b) The [MASK] token is replaced by the context-aware tokens $\boldsymbol{c}_v$ to form the composite context $\boldsymbol{u}_v^k$. This context is then input into a projection module with two identity embeddings and a Transformer encoder to output the estimated Gaussian distribution parameters $(\boldsymbol{\mu}_v^{m_k}, \boldsymbol{\sigma}_v^{m_k})$.
Figure 4: Rate-distortion curves of our approach and different baselines on Cityscapes and InStereo2K datasets.
Figure 5: Complexity of learning-based codecs on the InStereo2K dataset. Both encoding and decoding time are evaluated by an NVIDIA V100 GPU.
...and 5 more figures

CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

TL;DR

Abstract

CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (10)