Table of Contents
Fetching ...

Project-and-Fuse: Improving RGB-D Semantic Segmentation via Graph Convolution Networks

Xiaoyan Jiang, Bohan Wang, Xinlong Wan, Shanshan Chen, Hamido Fujita, Hanan Abd. Al Juaid

TL;DR

The paper addresses misalignment and counter-intuitive patches in RGB-D semantic segmentation caused by traditional feature-level fusion. It introduces a Project-and-Fuse framework that performs late fusion guided by texture priors, encodes depth as a three-channel normal map for CNN-friendly 3D feature extraction, and constructs a semantic- and location-aware graph to reason about region relationships via Graph Convolution Networks. A projection matrix with KL-based hard-pixel mining and locality-aware adjacency edges combats Biased-Assignment and Ambiguous-Locality, while a graph-to-image re-projection yields final pixel-wise predictions. Across NYU-DepthV2 and SUN RGB-D, the approach yields consistent performance gains, validating the effectiveness of texture-guided fusion, depth-to-normal encoding, and graph-based relational reasoning for robust RGB-D segmentation. The methodology offers practical benefits by enabling more explainable fusion and efficient depth processing, suitable for indoor scene understanding tasks.

Abstract

Most existing RGB-D semantic segmentation methods focus on the feature level fusion, including complex cross-modality and cross-scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter-intuitive patches in the segmentation results. Inspired by the popular pixel-node-pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface tendencies.At projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback-Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU-DepthV2 and SUN RGB-D, have shown that our approach can consistently boost the performance of RGB-D semantic segmentation task.

Project-and-Fuse: Improving RGB-D Semantic Segmentation via Graph Convolution Networks

TL;DR

The paper addresses misalignment and counter-intuitive patches in RGB-D semantic segmentation caused by traditional feature-level fusion. It introduces a Project-and-Fuse framework that performs late fusion guided by texture priors, encodes depth as a three-channel normal map for CNN-friendly 3D feature extraction, and constructs a semantic- and location-aware graph to reason about region relationships via Graph Convolution Networks. A projection matrix with KL-based hard-pixel mining and locality-aware adjacency edges combats Biased-Assignment and Ambiguous-Locality, while a graph-to-image re-projection yields final pixel-wise predictions. Across NYU-DepthV2 and SUN RGB-D, the approach yields consistent performance gains, validating the effectiveness of texture-guided fusion, depth-to-normal encoding, and graph-based relational reasoning for robust RGB-D segmentation. The methodology offers practical benefits by enabling more explainable fusion and efficient depth processing, suitable for indoor scene understanding tasks.

Abstract

Most existing RGB-D semantic segmentation methods focus on the feature level fusion, including complex cross-modality and cross-scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter-intuitive patches in the segmentation results. Inspired by the popular pixel-node-pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface tendencies.At projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback-Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU-DepthV2 and SUN RGB-D, have shown that our approach can consistently boost the performance of RGB-D semantic segmentation task.

Paper Structure

This paper contains 27 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The counter-intuitive cases. Existing segmentation results exist scenes that are against human's commonsense. The first row shows that there exist toilet(white) next to the window(purple), and the second row shows there exist emerge a small patch of ceiling in the middle of shelves.
  • Figure 2: (a) The popular pixel-graph-pixel networks pipeline, which is mainly composed of four modules: feature extraction and representation learning as most backbones do; graph construction based on the learned feature maps, graph representation updating via various graph neural networks; graph re-projection to transform graph back to the feature map. (b) The proposed fusion pipeline. We encode the depth map so that 3D features can be better captured by CNNs. The original pipeline is updated to modal fusion so that complementary information can be aggregated.
  • Figure 3: Details of the proposed approach. We first encode the depth map into normal map, so that the two modalities can be sent into parallel feature extraction branches. Graph construction module takes two feature maps as its input and output the fused graph. Pixels that have similar semantics and localities are marked as a region and assigned to the same node. The similarities between two regions are considered to generate the edge weights. Afterward, graph neural networks are adopted to update node features. Finally, updated node feature are back-projected to the feature map.
  • Figure 4: (a): Illustration of depth encoding process. We first project depth map to point cloud; then, the least square fitting is adopted to compute the normal vector of each point; finally, we get the final normal map depicting the object surface normal tendencies. (b): Details of graph construction process. The graph construction module takes feature map from two modalities as its input and output fused graph, containing node feature and adjacent matrix. Note that the fusion operation can be a simple summation or concatenation, which will be discussed in \ref{['sec_sub_ablation_study']} and there are serval options of generating edge weights, each will be introduced in \ref{['sec_sub_gen_edge']}
  • Figure 5: Illustration and visualization of 2 types of positional encoding process for one layer in the projection matrix. For 2D branch, we take the projection matrix as its input and for 3D branch, both projection matrix and the depth map are required. The output are computed positional encoding, where $x$ and $y$ is coordinates of region center in 2D space and $z$ is the average depth value of current region.
  • ...and 3 more figures