CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression
Xinjie Zhang, Shenyuan Gao, Zhening Liu, Jiawei Shao, Xingtong Ge, Dailan He, Tongda Xu, Yan Wang, Jun Zhang
TL;DR
CAMSIC addresses the inefficiency of existing stereo image codecs that rely on entropy models derived from single-image codecs by introducing a content-aware masked image modeling (MIM) Transformer entropy model. The method uses independent encoders for each view and a decoder-free Transformer that, via content-aware tokens generated from priors, enables bidirectional interaction between prior information and estimated tokens, eliminating the need for a separate decoder. Key contributions include the content-aware MIM style, a decoder-free Transformer entropy model, and a simple yet effective encoder-decoder framework that yields state-of-the-art rate-distortion on Cityscapes and InStereo2K with fast encoding/decoding. The approach demonstrates the practical impact of advanced entropy modeling for stereo compression, enabling higher efficiency without complex disparity-warped architectures.
Abstract
Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed. Code is available at https://github.com/Xinjie-Q/CAMSIC.
