Table of Contents
Fetching ...

Region-Enhanced Feature Learning for Scene Semantic Segmentation

Xin Kang, Chaoqun Wang, Xuejin Chen

TL;DR

This work tackles the challenge of modeling long-range context in large-scale point clouds for indoor scene semantic segmentation. It introduces REFL-Net, which uses a Region-based Feature Enhancement (RFE) module consisting of Semantic-Spatial Region Extraction (SSRE) and Region Dependency Modeling (RDM) to compute region-level attention and fuse it with point features. The region representation reduces computation to $O(M^2)$ with $M \

Abstract

Semantic segmentation in complex scenes relies not only on object appearance but also on object location and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to the huge computational cost for large-scale point clouds. In this paper, we propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning Network (REFL-Net) that leverages region correlations to enhance point feature learning. We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage. In the first stage, the input points are grouped into a set of regions based on their semantic and spatial proximity. In the second stage, we explore inter-region semantic and spatial relationships by employing a self-attention block on region features and then fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is plug-and-play and can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetV2 and S3DIS datasets and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS with negligible computational cost compared with backbone models. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net.

Region-Enhanced Feature Learning for Scene Semantic Segmentation

TL;DR

This work tackles the challenge of modeling long-range context in large-scale point clouds for indoor scene semantic segmentation. It introduces REFL-Net, which uses a Region-based Feature Enhancement (RFE) module consisting of Semantic-Spatial Region Extraction (SSRE) and Region Dependency Modeling (RDM) to compute region-level attention and fuse it with point features. The region representation reduces computation to with $M \

Abstract

Semantic segmentation in complex scenes relies not only on object appearance but also on object location and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to the huge computational cost for large-scale point clouds. In this paper, we propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning Network (REFL-Net) that leverages region correlations to enhance point feature learning. We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage. In the first stage, the input points are grouped into a set of regions based on their semantic and spatial proximity. In the second stage, we explore inter-region semantic and spatial relationships by employing a self-attention block on region features and then fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is plug-and-play and can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetV2 and S3DIS datasets and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS with negligible computational cost compared with backbone models. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net.
Paper Structure (23 sections, 6 equations, 6 figures, 7 tables)

This paper contains 23 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Object category confusion in semantic segmentation. In the point cloud (a), the footrest (red circle) belonging to 'Other' category has a similar texture to the 'Chair' (yellow circle), while 'Desk' (orange circle) has a similar geometry to 'Table' (blue circle). With only local feature aggregation, the baseline model predicts wrong categories for the footrest and the desk (b). With our region-based feature enhancement module, the model can well integrate long-range context and make correct category predictions (c). The feature distributions shown in (e) and (f) demonstrate that our method with RFE learns more distinctive point features.
  • Figure 2: Overview of our REFL-Net for point cloud semantic segmentation. With a general segmentation backbone that extracts point features and makes the initial prediction, our RFE module extracts region-level context to enhance the point features for better semantic segmentation. The RFE module consists of Semantic-Spatial Region Extraction (SSRE) and Region Dependency Modeling (RDM). The SSRE module takes the initial predictions as input and separates the point cloud into a set of local regions. The RDM module models the semantic and spatial correlations between regions and concatenates point features and region features for the final prediction.
  • Figure 3: Semantic-spatial region extraction. From the initial semantic segmentation results (a), the point cloud is divided into a number of semantic groups (b). Then, based on the Euclidean distance between points and the uniformly sampled region centers (c), each group is split into fine-grained regions (d).
  • Figure 4: Semantic segmentation results of three indoor scenes in ScanNetV2. (a) The input point clouds. We visualize the semantic segmentation errors (colored points) of MinkowskiNet choy20194d and our REFL-Net in (b) and (c), respectively. Grey points are correctly classified compared with the ground truth (d).
  • Figure 5: Attention map visualization. The query region is marked by the red circles. The two attention maps in our RDM module demonstrate the context of the scene layout (d) and surrounding objects (e). The point-wise attention in PointTransformerzhao2021point only captures contextual information in a local neighborhood (f), leading to incorrect predictions for the middle chairs.
  • ...and 1 more figures