Table of Contents
Fetching ...

2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision

Cheng-Kun Yang, Min-Hung Chen, Yung-Yu Chuang, Yen-Yu Lin

TL;DR

This work introduces MIT, a Multimodal Interlaced Transformer designed for weakly supervised point cloud segmentation using scene-level tags. MIT uses two transformers (one for 3D voxels and one for 2D multi-view images) and an interlaced decoder that alternately treats 3D tokens as queries and 2D tokens as queries to achieve implicit 2D-3D fusion without camera poses or depth maps. A contrastive loss aligns class tokens across modalities, and pseudo-labels enable end-to-end training under weak supervision. Evaluations on ScanNet and S3DIS show MIT outperforms existing scene-level and 2D-3D fusion baselines, demonstrating effective fusion of texture-rich 2D information with geometric 3D structure for improved segmentation. The approach offers a scalable, pose-free pathway to leverage multimodal data in large-scale 3D understanding tasks with minimal annotation burden.

Abstract

We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the roles of queries and key-value pairs in the decoder layers. It turns out that the 2D and 3D features are iteratively enriched by each other. Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The project page will be available at https://jimmy15923.github.io/mit_web/.

2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision

TL;DR

This work introduces MIT, a Multimodal Interlaced Transformer designed for weakly supervised point cloud segmentation using scene-level tags. MIT uses two transformers (one for 3D voxels and one for 2D multi-view images) and an interlaced decoder that alternately treats 3D tokens as queries and 2D tokens as queries to achieve implicit 2D-3D fusion without camera poses or depth maps. A contrastive loss aligns class tokens across modalities, and pseudo-labels enable end-to-end training under weak supervision. Evaluations on ScanNet and S3DIS show MIT outperforms existing scene-level and 2D-3D fusion baselines, demonstrating effective fusion of texture-rich 2D information with geometric 3D structure for improved segmentation. The approach offers a scalable, pose-free pathway to leverage multimodal data in large-scale 3D understanding tasks with minimal annotation burden.

Abstract

We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the roles of queries and key-value pairs in the decoder layers. It turns out that the 2D and 3D features are iteratively enriched by each other. Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The project page will be available at https://jimmy15923.github.io/mit_web/.
Paper Structure (43 sections, 4 equations, 7 figures, 12 tables)

This paper contains 43 sections, 4 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Overview of the Multimodal Interlaced Transformer (MIT). The input includes a 3D point cloud, multi-view 2D images, and class-level tags of a scene. Our method is a transformer model with two encoders and one decoder. The two encoders compute features for 3D voxel tokens and 2D view tokens, respectively. The decoder conducts interlaced 2D-3D attention and carries out 2D and 3D feature fusion. In its odd layers, 3D voxels serve as queries and are enriched by the semantic features of 2D views, acting as key-value pairs. In the even layers, the roles of 3D voxels and 2D views switch: 2D views are described by additional 3D geometric features.
  • Figure 2: An overview of our Multimodal Interlaced Transformer (MIT) for weakly supervised point cloud segmentation. It is a transformer-based model with two encoders, $\tilde{f}_{\text{3D}}$ and $\tilde{f}_{\text{2D}}$, for modality-specific feature extraction and one decoder, $f_d$, for feature fusion. The 2D and 3D pooled features, $\hat{s}_\text{2D}$ and $\hat{s}_\text{3D}$, are added to each learnable position embedding ($\hat{z}_\text{2D}$ and $\hat{z}_\text{3D}$), and further prepended with the class tokens and passed through the encoders to obtain self-attended features, $F_{\text{2D}}$ and $F_{\text{3D}}$. The predicted class scores for each modality are obtained through average pooling and class-aware layers.
  • Figure 3: The architecture of an interlaced block. The multilayer perceptron with residual learning is not present for simplicity but is used in the block.
  • Figure 4: Qualitative results on the ScanNet dataset with scene-level supervision. The colored boxes highlight the differences between our MIT and MIT with 3D data only, and their corresponding views are shown on the right with outlines of the same color. For each view, the tags at the top indicate the results of the multi-label classification.
  • Figure 5: Network architecture of our MIT extension with camera poses and depths maps.
  • ...and 2 more figures