Table of Contents
Fetching ...

MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation

Duc Dang Trung Tran, Byeongkeun Kang, Yeejin Lee

TL;DR

A novel framework called MSTA3D is proposed, which leverages multi-scale feature representation and introduces twin-attention mechanisms to effectively capture superpoints and integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries.

Abstract

Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.

MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation

TL;DR

A novel framework called MSTA3D is proposed, which leverages multi-scale feature representation and introduces twin-attention mechanisms to effectively capture superpoints and integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries.

Abstract

Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.

Paper Structure

This paper contains 16 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The proposed MSTA3D, a 3D instance segmentation framework, tackles existing challenges by leveraging multi-scale feature representation and spatial query/regularizer.
  • Figure 2: The MSTA3D framework for instance segmentation on point clouds.
  • Figure 3: The architecture of twin-attention-based decoder. The twin-attention-based decoder fuses multi-scale features $\mathbf{S}_h$ and $\mathbf{S}_\ell$ and predicts $\mathbf{X}^L$ by refining box queries.
  • Figure 4: The architecture of box regularizer. The box regularizer predicts positional differences between bounding boxes derived from scene-wise features and those derived from instance-wise features.
  • Figure 5: Comparisons of model complexity
  • ...and 4 more figures