Table of Contents
Fetching ...

ODIN: A Single Model for 2D and 3D Segmentation

Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki

TL;DR

ODIN introduces a unified transformer-based architecture that can perform both 2D image and 3D point-cloud instance segmentation by interleaving 2D within-view fusion with 3D cross-view fusion. It differentiates 2D and 3D tokens via distinct positional encodings and employs a 2D-to-3D unprojection and 3D-to-2D projection, along with a $k$-NN Transformer using relative 3D positions, to fuse information across views. The model leverages pre-trained 2D backbones, supports open-vocabulary class decoding for multi-dataset training, and achieves state-of-the-art results on ScanNet200, Matterport3D, and AI2THOR, particularly when using sensor RGB-D inputs directly rather than mesh-derived point clouds. ODIN’s ablations demonstrate the importance of cross-view fusion, joint 2D-3D training, and strong 2D pre-training, suggesting a promising path for embodied vision where a single architecture can handle diverse perception tasks with real sensor data.

Abstract

State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).

ODIN: A Single Model for 2D and 3D Segmentation

TL;DR

ODIN introduces a unified transformer-based architecture that can perform both 2D image and 3D point-cloud instance segmentation by interleaving 2D within-view fusion with 3D cross-view fusion. It differentiates 2D and 3D tokens via distinct positional encodings and employs a 2D-to-3D unprojection and 3D-to-2D projection, along with a -NN Transformer using relative 3D positions, to fuse information across views. The model leverages pre-trained 2D backbones, supports open-vocabulary class decoding for multi-dataset training, and achieves state-of-the-art results on ScanNet200, Matterport3D, and AI2THOR, particularly when using sensor RGB-D inputs directly rather than mesh-derived point clouds. ODIN’s ablations demonstrate the importance of cross-view fusion, joint 2D-3D training, and strong 2D pre-training, suggesting a promising path for embodied vision where a single architecture can handle diverse perception tasks with real sensor data.

Abstract

State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).
Paper Structure (23 sections, 1 equation, 5 figures, 8 tables)

This paper contains 23 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Omni-Dimensional INstance segmentation (ODIN) is a model that can parse either a single RGB image or a multiview posed RGB-D sequence into 2D or 3D labelled object segments respectively. Left: Given a posed RGB-D sequence as input, ODIN alternates between a within-view 2D fusion and a cross-view 3D fusion. When the input is a single RGB image, the 3D fusion layers are skipped. ODIN shares the majority of its parameters across both RGB and RGB-D inputs, enabling the use of pre-trained 2D backbones. Right: At each 2D-to-3D transition, ODIN unprojects 2D feature tokens to their 3D locations using sensed depth and camera intrinsics and extrinsics.
  • Figure 2: ODIN Architecture: The input to our model is either a single RGB image or a multiview RGB-D posed sequence. We feed them to ODIN's backbone which interleaves 2D within-view fusion layers and 3D cross-view attention layers to extract feature maps of different resolutions (scales). These feature maps exchange information through a multi-scale attention operation. Additional 3D fusion layers are used to improve multiview consistency. Then, a mask decoder head is used to initialize and refine learnable slots that attend to the multi-scale feature maps and predict object segments (masks and semantic classes).
  • Figure 3: 2D mAP Performance Variation with increasing number of context views used
  • Figure 4: Detailed ODIN Architecture Components: On the Left is the 3D RelPos Attention module which takes as input the depth, camera parameters and feature maps from all views, lifts the features to 3D to get 3D tokens. Each 3D token serves as a query. The K-Nearest Neighbors of each 3D token become the corresponding keys and values. The 3D tokens attend to their neighbours for L layers and update themselves. Finally, the 3D tokens are mapped back to the 2D feature map by simply reshaping the 3D feature cloud to 2D multi-view feature maps. On the Middle is the query refinement block where queries first attend to the text tokens, then to the visual tokens and finally undergo self-attention. The text features are optional and are only used in the open-vocabulary decoder setup. On the Right is the segmentation mask decoder head where the queries simply perform a dot-product with visual tokens to decode the segmentation heatmap, which can be thresholded to obtain the segmentation mask. In the Open-Vocabulary decoding setup, the queries also perform a dot-product with text tokens to decode a distribution over individual words. In a closed vocabulary decoding setup, queries simply pass through an MLP to predict a distribution over classes.
  • Figure 5: Qualitative Results on various 3D and 2D datasets