Table of Contents
Fetching ...

Cubify Anything: Scaling Indoor 3D Object Detection

Justin Lazarow, David Griffiths, Gefen Kohavi, Francisco Crespo, Afshin Dehghan

TL;DR

The paper tackles indoor 3D object detection from a single RGB(-D) frame by building CA-1M, a large-scale, pixel-perfect dataset with exhaustively labeled 3D boxes rendered to every frame. It introduces Cubify Transformer (CuTR), a fully Transformer-based detector that operates on RGB(-D) inputs without 3D inductive biases, and demonstrates that with CA-1M pretraining CuTR can outperform traditional point-based 3D detectors on diverse benchmarks. Notably, CuTR benefits from large-scale, high-fidelity data and pretraining, achieving strong recall and precision, while also showing robustness to depth noise and maintaining competitive RGB-only performance. The work argues for a data-centric, image-based approach to scaling indoor 3D perception and introduces a practical rendering pipeline to align world-space annotations with frame-level ground truth, enabling effective supervision at scale.

Abstract

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

Cubify Anything: Scaling Indoor 3D Object Detection

TL;DR

The paper tackles indoor 3D object detection from a single RGB(-D) frame by building CA-1M, a large-scale, pixel-perfect dataset with exhaustively labeled 3D boxes rendered to every frame. It introduces Cubify Transformer (CuTR), a fully Transformer-based detector that operates on RGB(-D) inputs without 3D inductive biases, and demonstrates that with CA-1M pretraining CuTR can outperform traditional point-based 3D detectors on diverse benchmarks. Notably, CuTR benefits from large-scale, high-fidelity data and pretraining, achieving strong recall and precision, while also showing robustness to depth noise and maintaining competitive RGB-only performance. The work argues for a data-centric, image-based approach to scaling indoor 3D perception and introduces a practical rendering pipeline to align world-space annotations with frame-level ground truth, enabling effective supervision at scale.

Abstract

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

Paper Structure

This paper contains 24 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The Cubify Anything 1M (CA-1M) dataset re-imagines ARKitScenes baruch2021arkitscenes by annotating 3D boxes for objects in a near-exhaustive, class-agnostic manner for over 1K of the laser-scanned scenes which have been registered to over 3000 iPad Pro RGB-D captures. We show the richness of these annotations from the perspective of a stationary FARO laser scanner in the panorama (top). The CA-1M dataset subsequently uses the registration to render the annotations in an pixel-accurate manner to every frame in the capture as shown in the selected frames of the second row to produce over 15 million frames capturing over 440K objects.
  • Figure 2: CA-1M is the first dataset to provide explicit 3D boxes which cover the full richness of objects while being both spatially accurate and pixel-perfect with respect to each frame. Existing datasets like SUN RGB-D, ScanNet v2, ARKitScenes are either small, coarsely labeled, or lack accurate mappings from world to image space. Since ARKitScenes and CA-1M are labeled on the same underlying data, we can show the effect of exhaustive labeling.
  • Figure 3: The CA-1M annotation tool targets robustly labeling 3D boxes for any object on high-resolution FARO point clouds. Multi-view projection of annotations to supporting images allows for accurate and reliable annotation even when the laser scans include only partial scans of objects, like those on the shelves and to the left of the desk in the accompanying image.
  • Figure 4: Per-frame ground-truth is determined by projecting world-space 3D boxes to each frame (left) and uses rendering to determine a coarse "instance mask" (middle) which can be used to filter and cut boxes to reflect the frame's visibility and occlusion characteristics (right).
  • Figure 5: While ScanNet++ is also labeled on FARO scans, it does not explicitly label 3D boxes, instead labeling instance segmentation on the FARO meshes. This presents a mismatch where objects are observed on the corresponding RGB capture but not in the underlying FARO scan. We show renderings of the FARO mesh (left) versus the RGB capture (right) where black regions correspond to missing regions. While CA-1M suffers from the same inherent limitation of stationary laser scanners, its explicit annotation of 3D boxes still allows for annotation of objects using multi-view image support, as seen in Figure \ref{['fig:annotation_tool']}.
  • ...and 3 more figures