DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Xiaoxuan Yu; Hao Wang; Weiming Li; Qiang Wang; Soonyong Cho; Younghun Sung

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Xiaoxuan Yu, Hao Wang, Weiming Li, Qiang Wang, Soonyong Cho, Younghun Sung

TL;DR

DOCTR tackles point cloud scene understanding for multiple objects across segmentation, pose estimation, and mesh reconstruction. It introduces a Disentangled Object-Centric Transformer with SGDQ to separately learn semantic and geometric information for all objects, enabling joint optimization in a unified framework. A hybrid bipartite matching objective and a mask-enhanced box refinement improve training stability and pose accuracy, achieving state-of-the-art results on ScanNet. By unifying multiple sub-tasks in a single model and leveraging inter-object relations, DOCTR reduces pipeline complexity and enhances robustness in cluttered scenes.

Abstract

Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

TL;DR

Abstract

Paper Structure (14 sections, 9 equations, 5 figures, 3 tables)

This paper contains 14 sections, 9 equations, 5 figures, 3 tables.

Introduction
Related Work
Methods
Sparse 3D Backbone
Disentangled Transformer Decoder
Training Design
Mask Enhanced Box Refinement
Experiment
Experimental Settings
Comparisons to the State-of-the-Arts
Ablation Study
Qualitative Comparisons
Conclusion
Acknowledgments

Figures (5)

Figure 1: Point Scene Understanding. With an incomplete point cloud scene as input, our method learns to segment each object instance and reconstruct its complete mesh. Compared to the recent DIMR tang2022point, our proposed DOCTR has a compact pipeline and achieves more accurate results in challenging cases such as when multiple objects are in close proximity.
Figure 2: Comparison of the pipelines between DIMR tang2022point and our DOCTR. DIMR first segments each object instance by clustering and then processes them independently with multiple stages for the different sub-tasks. Our DOCTR pipeline is much more compact that represents each object instance as a query and optimize for multiple objects and multiple sub-tasks in a unified manner.
Figure 3: Our proposed DOCTR pipeline consists of a backbone, a disentangled Transformer decoder (DTD), a prediction head, and a shape decoder. We propose semantic-geometry disentangled queries (SGDQs) to represent scene objects. The DTD is trained to attend the SGDQs to multi-scale point-wise features extracted from the sparse 3D U-Net backbone. In inference, each object SGDQ is passed to the prediction head to predict the object's mask, class, box (pose), and shape code. The shape code is input to the shape decoder to reconstruct the object's complete mesh, which is then aligned by the estimated pose to the scene.
Figure 4: Disentangled Transformer Decoder Layer
Figure 5: Qualitative comparison on the ScanNet dataset

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

TL;DR

Abstract

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (5)