Table of Contents
Fetching ...

Stereo Image Coding for Machines with Joint Visual Feature Compression

Dengchao Jin, Jianjun Lei, Bo Peng, Zhaoqing Pan, Nam Ling, Qingming Huang

TL;DR

This work introduces SICM, a framework for stereo image coding optimized for machine vision tasks, by learning to compress stereo visual features rather than raw images. The proposed MVSFC-Net combines a stereo feature extractor, a stereo multi-scale feature compression (SMFC) module, and a visual-analysis head for 3D object detection, with a rate-distortion objective that prioritizes task performance. The SMFC module jointly reduces intra-view, inter-view, and cross-scale redundancies to produce compact representations, yielding substantial BD-rate reductions (up to ~81% on AP3D and ~77% on APBEV) compared to MPEG anchors and prior SIC methods, particularly at low bitrates. Ablation confirms the importance of SMFC for performance, and the method achieves favorable encoding/decoding efficiency, indicating strong potential for practical machine-vision–focused stereo coding under bandwidth constraints.

Abstract

2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.

Stereo Image Coding for Machines with Joint Visual Feature Compression

TL;DR

This work introduces SICM, a framework for stereo image coding optimized for machine vision tasks, by learning to compress stereo visual features rather than raw images. The proposed MVSFC-Net combines a stereo feature extractor, a stereo multi-scale feature compression (SMFC) module, and a visual-analysis head for 3D object detection, with a rate-distortion objective that prioritizes task performance. The SMFC module jointly reduces intra-view, inter-view, and cross-scale redundancies to produce compact representations, yielding substantial BD-rate reductions (up to ~81% on AP3D and ~77% on APBEV) compared to MPEG anchors and prior SIC methods, particularly at low bitrates. Ablation confirms the importance of SMFC for performance, and the method achieves favorable encoding/decoding efficiency, indicating strong potential for practical machine-vision–focused stereo coding under bandwidth constraints.

Abstract

2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.

Paper Structure

This paper contains 19 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The architecture of the proposed MVSFC-Net. For the encoding stereo images $\left\{I_{L}, I_{R}\right\}$, the stereo feature extraction module is firstly applied to obtain the stereo multi-scale features $\left\{f_{L}^{i}, f_{R}^{i}\|i\in 0,1,2\right\}$. Then, the $\left\{f_{L}^{i}, f_{R}^{i}\|i\in 0,1,2\right\}$ are efficiently compressed by the proposed stereo multi-scale feature compression module. Finally, the visual analysis module deployed at service-end is utilized to perform vision task based on reconstructed stereo multi-scale features $\left\{\hat{f}_{L}^{i}, \hat{f}_{R}^{i}\|i\in 0,1,2\right\}$.
  • Figure 2: Rate-distortion curves comparison when the distortion is measured by $\rm AP_{3D}$.
  • Figure 3: Rate-distortion curves comparison when the distortion is measured by $\rm AP_{BEV}$.
  • Figure 4: The visual comparison of 3D detection results in RGB images and 3D space. The blue bounding box and red bounding box denotes the ground truth result and predicted result, respectively. (a) The 65-th left image in the validation set. (b) The 147-th left image in the validation set. (c) The 157-th left image in the validation set.
  • Figure 5: Ablation Results.